- We are sorry, but NCBI web applications do not support your browser and may not function properly. More information

- Journal List
- NIHPA Author Manuscripts
- PMC2790016

# Case-Control Association Testing in the Presence of Unknown Relationships

^{1}Department of Biostatistics, University of Washington, Seattle, WA, 98195, USA

^{2}Department of Genome Sciences, University of Washington, Seattle, WA, 98195, USA

^{3}Division of Medical Genetics, Department of Medicine, University of Washington, Seattle, WA, 98195, USA

## Abstract

Genome-wide association studies result in inflated false positive results when unrecognized cryptic relatedness exists. A number of methods have been proposed for testing association between markers and disease with a correction for known pedigree-based relationships. However, in most case-control studies, relationships are generally unknown, yet the design is predicated on the assumption of at least ancestral relatedness among cases. Here, we focus on adjusting cryptic relatedness when the genealogy of the sample is unknown, particularly in the context of samples from isolated populations where cryptic relatedness may be problematic. We estimate cryptic relatedness using maximum-likelihood methods and use a corrected chi-square test with estimated kinship coefficients for testing in the context of unknown cryptic relatedness. Estimated kinship coefficients characterize precisely the relatedness between truly related people, but are biased for unrelated pairs. The proposed test substantially reduces spurious positive results, producing a uniform null distribution of p-values. Especially with missing pedigree information, estimated kinship coefficients can still be used to correct non-independence among individuals. The corrected test was applied to real data sets from genetic isolates and created a distribution of p-value that was close to uniform. Thus the proposed test corrects the non-uniform distribution of p-values obtained with the uncorrected test and illustrates the advantage of the approach on real data.

**Keywords:**Cryptic Relatedness, Kinship Coefficient, Corrected

*χ*

^{2}test, Type I error, Genome scan

## INTRODUCTION

Genome-wide association studies have been proposed as an efficient and powerful method of uncovering genetic variants that contribute to complex disease [Hirschhorn and Daly, 2005; The Wellcome Trust Case Control Consortium, 2007]. Especially when genes have modest effects on disease risk and have common risk allele frequencies, association studies are believed to be more powerful than linkage studies, which are widely used for detecting genes of a major effect [Risch and Merikangas, 1996; Cardon and Bell, 2001; Carlson et al., 2004]. However association testing can lead to spurious positive results when unrecognized population structure exists [Hirschhorn and Daly, 2005; Cardon and Bell, 2001]. This has motivated the development and use of robust association methods to correct for effects of population structure caused by stratification, including the Transmission Disequilibrium Test (TDT) [Spielman et al., 1993; Ewans and Spielman, 1995], and generalizations implemented in the Family Based Association Test (FBAT) [Laid et al., 2000], even though such family-based designs diminish statistical efficiency. In population-based association studies, several statistical methods have also been developed to detect population stratification and to account for its effect on testing [Pritchard and Rosenberg, 1999; Devlin and Roeder, 1999; Bacanu et al., 2000; Shmulewitz et al., 2004; Price et al., 2006; Zang et al., 2007].

Another way to avoid effects of population stratification is to use population isolates. With a small number of founders and long isolation, population isolates have long been exceptional resources for genetic studies of simple genetic diseases. Researchers have also argued that population isolates provide advantages for mapping complex traits [Wright et al., 1999; Peltonen et al., 2000; Peltonen, 2000; Shifman and Darvasi, 2001; Escamilla, 2001; Venken and Del-Favero, 2007]. In addition to reducing genetic heterogeneity, such isolates have more homogeneous environmental backgrounds than are typical in larger heterogeneous populations [Peltonen, 2000; Shifman and Darvasi, 2001]. With less population stratification, Hardy-Weinberg Equilibrium (HWE) is more likely to hold in population isolates. [Peltonen et al., 2000]. These factors increase the validity of testing for differences in allele frequencies between cases and controls. However, a challenge to the use of population isolates is that classical statistical methods for ideal population data, which consist of independent individuals, may not be applicable: an important feature of population isolates is that the relatedness of two random individuals may be non-negligible [Bourgain and Genin, 2005]. Such cryptic relatedness may also be found in outbred populations as a source of population structure, and cryptic relatedness among affected individuals might be another serious source of spurious positive results [Devlin and Roeder, 1999; Bacanu et al., 2000] because cases share a genetic disorder [Bourgain and Genin, 2005; Voight and Pritchard, 2005].

There are a number of statistical methods that are designed to overcome this problem. When the entire genealogy of the sample is known, one suggested approach is to use the Armitage trend test with a correction factor that is computed conditional on the marker genotypes and pedigree structure [Slager and Schaid, 2001]. With the same idea, the standard *χ*^{2} test for comparing allele frequencies between cases and controls can be modified with a correction factor to account for relatedness among individuals (corrected *χ*^{2} test) [Bourgain et al., 2003]. This correction factor depends only on kinship and inbreeding coefficients derived directly from the pedigree information. While a modified Armitage trend test is not applicable to complicated pedigree structures, the corrected *χ*^{2} test can be used for any population structure for which kinship coefficients among pairs of individuals and inbreeding coefficients of every individual are known [Bourgain et al., 2003]. Furthermore, the *χ*^{2} test can be easily performed for multiallelic data, including the presence of rare alleles. To improve power, a quasi-likelihood score (QLS) test has also been proposed [Bourgain et al., 2003]. The QLS test is derived from the quasi-likelihood framework and also accounts for the correlations among individuals. However, the QLS test can lead to negative values for estimated allele frequencies, particularly when alleles are rare, thus making it less applicable to general situations than the corrected *χ*^{2} test. Finally, all such proposed corrections for relatedness to date assume knowledge of the relationships in the sample.

Since relationships will be unknown in most case-control studies, it is useful to consider approaches that do not depend on the existence of known relationships. Genetic relatedness among individuals can be estimated if genetic markers are available. Under the assumption that the loci are unlinked, several methods for estimating relatedness have been developed including traditional maximum-likelihood estimation [Thompson, 1975] and other estimators based on method-of moments approaches [Ritland, 1996; Lynch and Ritland, 1999; Wang, 2002]. Compared to other estimators, maximum-likelihood estimators of relatedness exhibit the desirable features of having consistently lower standard errors and being adaptable to many sampling situations [Milligan, 2003; Thompson, 1975].

In this paper, our goals are (1) to characterize the effect of cryptic relatedness on testing in case-control samples, particularly from an isolated population, (2) to estimate cryptic relatedness with such samples, and (3) to develop an approach for testing for association with a correction for cryptic relatedness. To test for association between markers and disease and to simultaneously account for unknown relatedness among individuals, we propose to modify the corrected *χ*^{2} test by using estimated, rather than assigned, coefficients of relatedness. We focus on the corrected *χ*^{2} test because of its desirable statistical behavior, ease of computation, and applicability to a variety of marker and data types. We perform simulations to evaluate the statistical properties of the corrected *χ*^{2} test for use in the context of case-control data, and we illustrate the advantage of the proposed method by application to two samples from genetic isolates.

## METHODS

### RELATIONHIP ESTIMATION

The simplest summary of the degree of a pairwise relationship is the kinship coefficient, *ϕ*, which is the probability that a pair of homologous alleles chosen from two individuals are identical by descent (IBD). In this section, we briefly describe how to estimate *ϕ* by estimating the set of three *k*-coefficients in a non-inbred population [Crow and Kimura, 1970].

The three k-coefficients, *k*_{0}, *k*_{1} and *k*_{2}, are defined as the probabilities that a pair of individuals share neither, one or both, respectively, of their two alleles at a locus IBD.

where *k _{i}* ≥ 0,

*k*

_{0}+

*k*

_{1}+

*k*

_{2}= 1 and

*X*is the number of alleles shared IBD. The kinship coefficient can be computed as

*ϕ*= 0.5

*k*

_{2}+ 0.25

*k*

_{1}.

Assume that individuals are from a non-inbred population in Hardy-Weinberg Equilibrium (HWE) and that each marker locus is segregating independently. For given IBD modes, the conditional probabilities of the seven possible identity by state (IBS) modes *S _{i}* are shown in Table I. For a single locus of observed IBS

*S*, the likelihood of the

_{i}*k*-coefficients is [Thompson, 1975] :

Assuming independent unlinked loci, the likelihood for the overall genome is then simply the product of the single locus likelihoods. We used the EM algorithm [Dempster et al., 1977] to find maximum-likelihood estimators for the *k*-coefficients [McPeek and Sun, 2000] and then obtained an estimate of the kinship coefficient = 0.5_{2} + 0.25_{1}. The EM algorithm provides more efficient computation than the simplex method, a hill-climbing optimization technique, which has been used previously [Milligan, 2003].

Relationship estimation can be extended to inbred populations by estimating probabilities of Jacquard’s nine IBD modes instead of three *k*-coefficients [Jacquard, 1972; Anderson and Weir, 2007; Weir et al., 2006]. The inbreeding coefficient, *f*, which is the probability that a person carries two alleles IBD at a locus, can also be estimated by maximizing the likelihood using the EM algorithm [Dempster et al., 1977; Thompson, 2000]. In human genetic isolated populations, it is rare to have inbreeding within an individual even though there are higher kinships among individuals [Agarwala et al., 2001].

### THE CORRECTION OF TYPE 1 ERROR FOR ALLELIC ASSOCIATION TESTING

A corrected *χ*^{2} test has previously been proposed to correct for inflated false positive rates for association tests in the context of known relationships [Bourgain et al., 2003]. We begin by outlining this test for the diallelic case. Suppose we have *N _{c}* sampled individuals in a case group and

*N*sampled individuals in a control group. Let

_{t}**Y**= (

*Y*

_{1}, …

*Y*, …

_{i}*Y*)

_{N}*where*

^{T}*Y*= 0.5 × (the number of alleles of type 1 in individual

_{i}*i*) and

*N*=

*N*+

_{c}*N*. Let

_{t}*p*be the frequency of allele 1, 0 <

*p*< 1. Under the null hypothesis of no association between a given marker and disease and HWE for the given marker,

*E*

_{0}(

**Y**) =

*p*

**1**and ${\mathit{Var}}_{0}\left(\mathbf{Y}\right)=\frac{1}{2}p\left(1-p\right)\mathbf{\Phi}$, where

_{N}where *f _{i}* is the inbreeding coefficient of individual

*i*, and

*ϕ*is the kinship coefficient between two individuals

_{ij}*i*and

*j*, 1 ≤

*i*,

*j*≤

*N*. Here

*f*= 0 if we assume a non-inbred population.

_{i}To test for an association between the marker alleles and the disease, we consider the model: *E*(*Y*) = *μ* = (*μ*_{1}, , *μ _{N}*)

^{T}with

*μ*=

_{i}*p*+

*r*if

*i*is a case and

*μ*=

_{i}*p*if

*i*is a control. The null hypothesis is

*H*

_{0}:

*r*= 0. The proposed corrected

*χ*

^{2}test is one that extends the

*χ*

^{2}test by taking into account the correlation structure among individuals. The test statistic of the corrected

*χ*

^{2}test is [Bourgain et al., 2003] :

where
${\overline{Y}}_{c}=\frac{1}{{N}_{c}}{\sum}_{i\in \mathit{cases}}{Y}_{i},\phantom{\rule{0.2em}{0ex}}\overline{Y}=\frac{1}{N}{\sum}_{i=1}^{N}{Y}_{i},\phantom{\rule{0.2em}{0ex}}{\mathbf{1}}_{N}={\left(1,\cdots ,1\right)}^{T}$, and **1**_{c} is a vector of length *N* with *i*th entry 1 if individual *i* is a case and 0 if *i* is a control. The statistic
${W}_{{\chi}_{\mathit{corr}}^{2}}$ follows a *χ*^{2} distribution with 1 degree of freedom asymptotically under the null hypothesis [Thornton and McPeek, 2007]. When all kinship and inbreeding coefficients are zero, the test statistic in (5) is the *χ*^{2} test statistic,

The corrected *χ*^{2} test can be easily extended to the multiallelic case and under the null distribution, the test statistic asymptotically follows a *χ*^{2} distribution with (*a* − 1) degrees of freedom, where *a* is the number of alleles, as described previously [Bourgain et al., 2003].

The corrected *χ*^{2} test requires known kinship coefficients and inbreeding coefficients. We propose replacing these coefficients with their MLEs. We call this statistic a corrected *χ*^{2} test with estimated coefficients,
${W}_{{\chi}_{\mathit{ecorr}}^{2}}$ and this is same as
${W}_{{\chi}_{\mathit{corr}}^{2}}$ except that we use rather than the assigned Φ from a known pedigree structure. With estimates of kinship and inbreeding coefficients, unknown background relatedness or cryptic relatedness caused by common population history can be accommodated. An estimate of kinship would also be useful in the case that a sample includes a mixture of known and unknown relationships or any missing/misspecified pedigree relationships.

### EVALUATION OF THE PROPOSED METHOD

To investigate the performance of kinship estimation, we estimated kinship coefficients based on both simulated data and CEPH family data. The simulated data was used to evaluate performance for use with multiallelic marker and with varying available relationship information, and the CEPH data was used to evaluate performance for use with denser SNP markers. We also performed a simulation study to explore the null distribution of *p*-values and to compare the power of the corrected *χ*^{2} tests using different types of kinship coefficients with the classical *χ*^{2} test. Four kinship coefficients were considered: actual (*ϕ _{act}*), pedigree-based (

*ϕ*), estimated (

_{ped}*ϕ*), and posterior (

_{est}*ϕ*) kinship coefficients. The actual kinship coefficient,

_{post}*ϕ*, is computed based on the underlying truth in the simulation and the pedigree-based kinship coefficient,

_{act}*ϕ*, is assigned from a known pedigree structure. The estimated kinship coefficient,

_{ped}*ϕ*, is the maximum-likelihood estimate computed based on genome data only, without pedigree information. The posterior kinship coefficient,

_{est}*ϕ*, is also an estimated coefficient, but based on both genome data and pedigree information. In all cases, we used the average of posterior kinship coefficients over the whole genome panel of markers. Merlin 1.1.2 was used to compute the posterior IBD and kinship coefficients [Abecasis et al., 2002].

_{post}We focused only on individuals and families from non-inbred populations, which means that inbreeding coefficients were assumed zero. The program **genedrop** in Morgan 2.8.2 [Thompson, 1995; Thompson, 2000b; Thompson, 2005] was used for generating marker genotypes for the simulated data sets. For each individual, we simulated 400 microsatellite markers based on Rutgers map positions and allele frequencies from the version 10 CEPH data at ftp://ftp.cephb.fr/ceph_genotype_db/. These markers were chosen randomly, subject to the condition that they were separated by an average of 10cM. Except where otherwise noted, we used all 400 markers for further analyses.

The simulation studies were performed based on three relationship scenarios. (1) In Scenario I, the case group consisted of 250 individuals from 50 pedigrees as seen in Figure 1 with each pedigree having five cases. The entire pedigree information was assumed known for the case group. The control group simply included 250 independent and unrelated individuals. (2) Scenario II was the same as Scenario I except the cases were chosen from a large pedigree of 13 generations and 4070 individuals to simulate the background correlation among individuals. Each set of parents in the 13 generations had one or two offspring, and 1124 individuals were unrelated founders. We assumed that the genealogy of only the last two generations was available. (3) In Scenario III, samples were simulated for 500 related and 500 independent unrelated people. The related individuals were in the final generations of a large pedigree of 13 generations as in Scenario II, but the pedigree information was assumed known for the last three generations. Controls in each of these three scenarios were unrelated, which is the most ideal situation. Cases of Scenario I were from an entirely known pedigree without any background correlation. Scenario II and III simulated background correlation among cases with limited pedigree information, which commonly occurs in genetic studies. Scenario II had less pedigree information than Scenario III, so we could explore more efficiently the effect of lack of pedigree information on *ϕ _{ped}*,

*ϕ*and

_{post}*ϕ*.

_{est}#### PERFORMANCE OF KINSHIP ESTIMATION

To evaluate the performance of kinship estimation, we carried out a simulation study based on Scenario I. Multiallelic marker simulation was done without reference to the fixed disease status, thus representing genotypes generated under *H*_{0}. The relationships in this sample are 150 parent-offspring (PO), 50 avuncular (AV), 100 grand parent-grand child (GG) and 50 first cousin (CO) pairs. Also, 124,400 unrelated pairs (UN) were included in the sample. For each individual, we simulated 400 microsatellite markers as described above and among these simulated markers, subsets of 50, 100, 200 and 400 markers were used to calculate MLEs of kinship coefficients using the EM algorithm. Allele frequencies were assumed unknown and sample allele frequencies were used for such estimation.

We also investigated the performance of kinship estimation based on diallelic markers by estimating kinship coefficients of CEPH families based on the real data, single-nucleotide polymorphism (SNP) markers from the version 10 CEPH data at ftp://ftp.cephb.fr/ceph_genotype_db/. We selected 91 individuals of 12 pedigrees with more than 80% genotyped markers. To investigate the effect of use of different numbers of markers, three marker panels were used: 500 SNP markers, 5,000 SNP markers, and all of the 16,977 SNP markers of the whole genome. In the first two panels, the markers were randomly chosen from the 16,977 SNP markers. Kinship coefficients were estimated for all possible pairs of 91 individuals based on these two marker panels. The entire pedigree information was known and estimated kinship coefficients were compared with pedigree-based kinship coefficients.

#### NULL DISTRIBUTION OF THE P-VALUES OF THE CORRECTED χ^{2} TEST

In Scenario I, we evaluated whether resulting *p*-values were distributed uniformly as would be expected under the null distribution. The *χ*^{2} test and the corrected *χ*^{2} test with *ϕ _{act}*,

*ϕ*,

_{ped}*ϕ*, and

_{est}*ϕ*were performed for each multiallelic marker and

_{post}*Q – Q*plots were used to examine the uniformity of the

*p*-value distributions. Note that the same markers are used for both the estimation of kinship coefficients in the previous section and for the test of association.

#### ASSESSMENT OF THE EFFECT OF MISSING PEDIGREE INFORMATION

We further investigated the null distribution of *p*-values for Scenario II where only part of the genealogy was known. We assumed only part of the genealogy known, and *ϕ _{ped}* and

*ϕ*were computed based on the pedigree information and the pedigree plus multiallelic marker information of the last two generations. Under the null distribution, we performed the classical

_{post}*χ*

^{2}test and the corrected

*χ*

^{2}test with

*ϕ*,

_{act}*ϕ*,

_{ped}*ϕ*, and

_{est}*ϕ*for each simulated marker.

_{post}*Q – Q*plots were used to present the null distribution of

*p*-values.

To assess the performance of various kinship coefficients based on limited pedigree information, *ϕ _{ped}*, and

*ϕ*were compared with

_{est}*ϕ*in Scenario III. The genealogy of only the last three generations was used for the computation of

_{act}*ϕ*and

_{ped}*ϕ*to mimic a typical situation in which more distant relationships are not known.

_{post}#### POWER OF THE χ^{2} AND CORRECTED χ^{2} TESTS

For comparing the power of the classical *χ*^{2} test and the corrected *χ*^{2} test with various kinship coefficients, we performed simulations based on Scenario III. The genotype of the trait locus was simulated under a disease allele frequency of 0.5, with affected status assigned based on the simulated genotypes at the trait locus and the penetrance probabilities shown in Table III. This yielded a variable number of cases and controls over replicates. The mean sizes of the case and control samples were reported with their standard deviations in Table III. The power of the *χ*^{2} test and the corrected *χ*^{2} test were estimated from 10,000 replicates.

### APPLICATION TO REAL DATA

The method was applied to two real data sets, one from Guam and the other from Kosrae. Both studies were approved by University of Washington Institutional Review board (IRB). The data set from Guam consisted of a genome scan for association in a case-control sample collected to complement a family-based linkage study. The disorder of cases has characteristics of both amyotrophic lateral sclerosis and parkinsonism combined with dementia and is prevalent in the Chamorros, the indigenous people of Guam. A sample of 140 cases and 88 age-matched Chamorro controls were available with no known relationships among these subjects at the time they were sampled. The markers consisted of a standard genome-scan panel of 402 multiallelic markers. For each pair of individuals, the kinship and inbreeding coefficients were estimated using the methods described above. For each marker, allele frequencies of cases and controls were compared using the standard and the corrected *χ*^{2} tests.

The data set from Kosrae consisted of a genome scan for schizophrenia [Wijsman et al., 2003]. A sample of 36 cases and 76 unrelated controls from the island was available with the genealogy known for the case group. Even though the pedigree information of the cases was available, we suspected there might be additional relatedness among cases and cryptic relatedness among controls since the data were collected from a relatively small population. The genome scan consisted of a standard panel of 379 microsatellite markers. As in the analysis of the Guam data, estimation of relatedness and association tests were performed. Also, using the known genealogy, the corrected *χ*^{2} test with pedigree-based assigned kinship coefficients was performed.

## RESULTS

### EVALUATION OF THE PROPOSED METHOD

#### PERFORMANCE OF KINSHIP ESTIMATION

With increasing number of markers used for estimation, *ϕ _{est}* was closer to

*ϕ*, with also less variation in the estimates from multiallelic markers. The values of the pedigree-based

_{ped}*k*-coefficients and kinship coefficients for non-inbred individuals are shown in Table II. Figure 2 shows the relation between

*ϕ*and

_{est}*ϕ*using 50 and 400 multiallelic markers (Panel A). For most relationships,

_{ped}*ϕ*was close to the pedigree-based expectation,

_{est}*ϕ*, with the accuracy of the estimate increasing with the number of markers. For example, in the case of PO pairs, the mean of

_{ped}*ϕ*was 0.255 (s.d. 0.009), 0.254 (0.006), 0.253 (0.005) and 0.252 (0.003) when 50, 100, 200 and 400 multiallelic markers were used, respectively. Estimation of other relationships showed a similar trend (not shown).

_{est}Similarly, in the case of diallelic markers, *ϕ _{est}* was closer to

*ϕ*when more markers were used. Figure 2 shows the the relation between

_{ped}*ϕ*and

_{est}*ϕ*in the diallelic case of CEPH families with 500, 5,000 and 16,977 SNP markers (Panel B). Panel B also shows the similar results of kinship estimation from 5,000 and 16,977 markers, which indicates that the number of SNPs does not need to be too large. The estimate of the kinship coefficient for parent-offspring was more accurate than that for full-siblings in the CEPH data. In the case of the 5,000 diallelic marker panel, the means of the estimated kinship coefficients were 0.243 (0.026, N=8) for full-sibling and 0.243 (0.008, N=87) for parent-offspring. The means of the parent-offspring and full-sibling relationships were similar, but that for the parent-offspring relationship had a smaller standard deviation even with the larger sample size. However, kinship coefficients were overestimated for unrelated pairs for both the simulated and CEPH data.

_{ped}The EM algorithm was used for estimating relatedness with computation time proportioned to *MN*^{2}, where *M* is the number of markers and *N* is the number of individuals. We implemented the method by using R-2.5.0. In the case of the CEPH data (91 individuals), computation times on a linux AMD Opteron 1.8GHz computer were 804.41 sec for 500 markers, 1617.68 sec for 1,000 markers and 7712.80 sec (2.14 hrs) for 5,000 markers.

#### NULL DISTRIBUTION OF THE P-VALUES OF THE CORRECTED χ^{2} TEST

The corrected *χ*^{2} test using kinship coefficients reduced the false positive rate substantially. The *Q – Q* plots of *p*-values of the four association tests for 50 and 400 markers are shown in Figure 3. Results for 100 and 200 multiallelic markers were consistent with these (not shown). Since genotypes of all markers were simulated under the null hypothesis of no association, *p*-values are expected to be uniformly distributed. Figure 3A suggests a high false positive rate when the classical *χ*^{2} test was used. Panels B, C and D of Figure 3 demonstrate that the corrected *χ*^{2} test reduces the false positive rate, producing a distribution of *p*-values that is close to uniform. Since the entire genealogy was known, *ϕ _{ped}* was close to

*ϕ*and the

_{act}*p*-values were uniformly distributed (Figure 3D). The

*ϕ*had the same results as the pedigree-based results (not shown). When

_{post}*ϕ*was used, the

_{est}*Q – Q*plots of

*p*-values were slightly different from the uniform distribution (Figure 3C). Overestimated kinship coefficients for UN pairs, especially when one individual is from the case group and the other is from the control group, may lead to slightly inflated false positive rates.

#### ASSESSMENT OF THE EFFECT OF MISSING PEDIGREE INFORMATION

With missing pedigree information, the corrected *χ*^{2} test using *ϕ _{est}* reduces the rate of spurious positive results more than when using

*ϕ*or

_{ped}*ϕ*. Figure 4 presents the

_{post}*Q – Q*plots of

*p*-values of the

*χ*

^{2}and the corrected

*χ*

^{2}tests when only part of the genealogy is known. As with the previous simulation, the

*χ*

^{2}test had a high false positive rate, and the

*p*-values of the corrected

*χ*

^{2}test with

*ϕ*were uniformly distributed (Figure 4). However, missing pedigree information resulted in undercorrection of false positives when

_{act}*ϕ*was used. In the case of

_{ped}*ϕ*, the

_{est}*Q – Q*plot was consistent with the previous simulation (with the entire genealogy) since the performance of kinship estimation was not affected by missing pedigree information.

*χ*

^{2}test and corrected

*χ*

^{2}tests in the simulated sample of Scenario II.

With the incomplete genealogy, *ϕ _{est}* gave more precise estimates for related pairs, but

*ϕ*and

_{ped}*ϕ*characterized unrelated pairs better because of the intrinsic overestimation of kinship coefficients in this case (Figure 5). Because MLEs of the kinship coefficient of UN pairs were overestimated, this approach may not distinguish unrelated people from people sharing a common genetic background. On the other hand, based on only the part of the pedigree information used in the analysis,

_{post}*ϕ*and

_{ped}*ϕ*were 0 for some of the pairs that shared their unknown genetic background. This leads to an under-correction of relatedness and an inflated false positive rate. For related pairs, the mean squared error (MSE) of

_{post}*ϕ*,

_{est}*ϕ*and

_{ped}*ϕ*was 1.1 ×10

_{post}^{−4}, 3.3 × 10

^{−4}and 3.2 × 10

^{−4}respectively. For unrelated pairs, the MSE of

*ϕ*was 1.36 × 10

_{est}^{−4}and the MSEs of

*ϕ*and

_{ped}*ϕ*were 0.

_{post}#### POWER OF THE CORRECTED χ^{2} TEST USING ESTIMATED AND ACTUAL KINSHIPS

Overall, the *χ*^{2} test is most powerful at the nominal significance level, as expected, given the inflated false positive rate. Table IV shows the estimated power for the *χ*^{2} test and the corrected *χ*^{2} tests with *ϕ _{act}*,

*ϕ*,

_{ped}*ϕ*and

_{est}*ϕ*. At the nominal significance level, the corrected

_{post}*χ*

^{2}test has lower power than the

*χ*

^{2}test, and in particular, the corrected

*χ*

^{2}test with

*ϕ*has the least power. The corrected

_{act}*χ*

^{2}test with

*ϕ*is more powerful than the corrected

_{est}*χ*

^{2}test with

*ϕ*and

_{ped}*ϕ*. This result shows that the correction for kinship may cause apparent loss of power given a nominal significance level. However the corrected tests give the proper null distribution of

_{post}*p*-values as we showed previously. As a result, the test has the correct type I error in contrast to the

*χ*

^{2}test, which is too liberal. Because power in our simulation was estimated under the nominal significance level, but the tests have different type I errors, it is not meaningful to compare actual power in this situation.

### APPLICATION TO REAL DATA

#### ESTIMATION OF CRYPTIC RELATIONSHIPS

Estimated kinship coefficients suggested that there were unknown relationships in the Guam and Kosrae data. Even though some pedigree information in the Kosrae data was known, more relationships were found by estimating kinship coefficients. Panel A and B of Figure 6 presents the estimated k-coefficients and Panel C and D of Figure 6 shows the cumulative distribution of the estimated kinship coefficients in the two samples. Estimation of k-coefficients allows us to quantify not only well-defined and possibly verifiable relatives such as parent-offspring, full sibling or cousin pairs, but also continuous degrees of relatedness. In the Guam data, the means and standard deviations of the estimated kinship coefficients are 0.023 (0.016) for the case group and 0.021 (0.020) for the control group with the case group having slightly higher average estimated kinship coefficients. This is consistent with the expectation that affected individuals may be more closely related because they share an ancestral mutation leading to the genetic disorder [Voight and Pritchard, 2005]. In the Kosrae data, the mean estimated kinship coefficients in the case group is 0.026 (0.025), which is also slightly higher than 0.024 (0.017) in the control group.

#### TESTING ASSOCIATIONS

The corrected *χ*^{2} test based on *ϕ _{est}* successfully reduced spurious associations between markers and disease.

*Q – Q*plots of the

*p*-values are shown in Panel E and F of Figure 6. In the Guam data analysis, p-values of the uncorrected

*χ*

^{2}tests are not uniformly distributed, but p-values of the corrected

*χ*

^{2}tests based on

*ϕ*are close to uniformly distributed. These results show that the false positive rate was reduced by correcting for relatedness. If the distribution of the

_{est}*p*-values were exactly uniform, the expected 1% and 5% quantiles would be 0.01 and 0.05, but the 1% and 5% quantiles of the

*χ*

^{2}test are 0.0007 and 0.007, which are considerably smaller than the nominal quantiles. For the corrected

*χ*

^{2}test, the 1% and 5 % quantiles are less extreme at 0.02 and 0.098, and are actually conservative. When the uncorrected

*χ*

^{2}test was used, 95 markers were significantly associated with the disease at significance level 0.05, but only 10 markers remained at this significance level with the corrected

*χ*

^{2}test.

Similar to the Guam data analysis, a *Q – Q* plot of *p*-values in the Kosrae data analysis reveals that the *χ*^{2} test results in a high false positive rate, but these false associations diminished in two corrected *χ*^{2} tests (Figure 6F). Since, in the Kosrae study, the pedigree information of the case group was available, the corrected *χ*^{2} tests were performed with *ϕ _{ped}*. However, since the known genealogy does not include complete information of relatedness among individuals, the corrected

*χ*

^{2}tests work better with

*ϕ*than with

_{est}*ϕ*. The 1% and 5% quantiles of the

_{ped}*χ*

^{2}test are 0.0014 and 0.0095, and the 1% and 5 % quantiles of the corrected

*χ*

^{2}test with

*ϕ*are 0.0095 and 0.036, much closer to their nominal levels. We had 50 significant markers at significance level 0.05 with the

_{est}*χ*

^{2}test, but 22 of these markers were not significant after correcting relatedness by using

*ϕ*.

_{est}After correction of relatedness among individuals, the *p*-values increased, which implies less significant associations between disease and markers. The ten markers with the smallest *p*-values are reported in Table V for the Guam study and Table VI for the Kosrae study. For the Guam study, the correction for cryptic relatedness gave a p-value that was 1-3 orders of magnitude higher than that from the uncorrected test, while for the Kosrae study, the p-values for all these extreme markers increased by one order of magnitude.

## DISCUSSION

We have presented here an evaluation of the usefulness of estimated kinship coefficients to account for cryptic relatedness in a population-based association study. Such cryptic relatedness may occur even when there is no population structure and when pedigree information is available. For example, the pedigrees may not fully describe the actual relatedness among individuals, such as when individuals share a long common genetic background history, or in other situations such as the presence of non-paternity or unreported adoption. We showed that the use of whole-genome scan markers to estimate relationships and to determine an appropriate variance correction reduces the false positive rate that otherwise results from use of the data without such a correction.

While part of our investigation involved simulation, simulation cannot evaluate every possible situation. In particular, in isolated populations, there may be variable levels of inbreeding, common genetic background from the shared complex and deep genealogies, or differences between social kinship and biological kinship. Such isolated populations motivated us to explore Scenario II and III which simulated the relatively simple situation of the availability of only partial information about relatedness among samples. Real situations probably involve much deeper genealogies, but even these simple situations demonstrated the impact of the missing relationship information.

To properly reduce spurious positive evidence of association, it is important to get accurate estimates of relationships. Our results indicate that estimated kinship coefficients can be more accurate than are pedigree-based or posterior kinship coefficients. The reasons for this improved accuracy are that pedigree-based kinship coefficients allow no variability within a class of relative pair, and posterior kinship coefficients do not allow for sources of relationship other than the pedigree. The one exception to this is the case of unrelated pairs, for which MLEs can be biased because only non-negative values are allowed for estimated kinship coefficients, thereby resulting in overcorrection of the test statistic. An ad hoc method might be considered to avoid overcorrecting for such related pairs, such as setting kinship coefficients to 0 for pairs with estimated kinship coefficients smaller than a specific threshold. This might improve the over correction of inflated false positive rates. Another concern in estimating relatedness is potential violation of the HWE assumption which might affect the resulting MLEs if there is Hardy-Weinberg Disequilibrium (HWD) in the sample. We have not formally evaluated the effect of HWD on the resulting estimates. Since substantial HWD is rare in unstructured populations and inbreeding (leading to HWD) levels are small even in genetic isolates that have been evaluated [Agarwala et al., 2001], it is likely that the impact of HWD is minimal.

We estimated relationships based on diallelic markers with comparable accuracy to those based on multiallelic markers. As we showed, the 5,000 and 16,977 marker panels had similar variances for kinship estimates, indicating that all markers in the dense SNP panel are not necessarily needed. The subset of markers that were spaced suffciently far apart provides enough information for estimating kinship coefficients and removes the need to account for linkage-disequilibrium (LD) among SNP dense markers. We can therefore consider SNP markers as the simplest case of multiallelic markers and the analysis performed on multiallelic markers in this study can be applicable and valid to SNP markers as well. Computation time for SNPs was also not exorbitant, compared to smaller number of multiallelic markers, even for our R code used to carry out these analyses. Computation time could be improved by carrying out some computations with C or C++.

In the proposed method, the idea of correcting the variance using genomic data is similar to the idea behind Genomic Control (GC) [Devlin and Roeder, 1999]. Both approaches use a constant correction for tests based on all genome markers. However, GC corrects the test statistics after inflated test statistics are observed, while the proposed method uses the genome data to adjust the variance so that the test statistic is not inflated in the first place. This also explains why the proposed test is straightforward for more complicated markers such as multiallelic markers, while GC needs separate correction factors for each type of marker, defined by the number of alleles. GC suffers when used for multiallelic markers because a separate correction for each type of marker is needed, and because a large number of markers of each type of marker are required to achieve an adequate correction [Marchini et al., 2004]. In general, multiallelic markers or haplotypes based on multiple SNPs can be considerably more informative than diallelic markers for association testing [Ott and Rabinowitz, 1997; Chapman and Wijsman, 1998]. For use with multiallelic markers or haplotype data, the easy extension and computation of the corrected *χ*^{2} test to the multiallelic situation is an advantage over GC.

A previous study recommended use of pedigree-based (prior) kinship coefficients because of the theoretical justification for the test statistic [Thornton and McPeek, 2007]. While we have not studied the asymptotic properties for the corrected *χ*^{2} test based on estimated kinships, as shown in the simulation study, p-values are uniformly distributed under the null hypothesis, so that the corrected *χ*^{2} test based on estimated kinships is useful in practice and is also applicable to testing associations. We believe the corrected *χ*^{2} test with estimated kinships would be more appropriate for reducing the false positive rate if there is background correlation among individuals beyond what is reflected in the known pedigree structures. Especially among cases, individuals are more likely to share their ancestral background and to be genetically similar to one another than are controls [Voight and Pritchard, 2005]. Estimated kinship coefficients would also be more appropriate for adjusting an unbalance of relatedness between the case and control group. We believe that the proposed approach will provide a practical tool for association studies to reduce false positive results caused by the cryptic relatedness beyond a known or unknown genealogy.

## Acknowledgments

This study was supported by National Institutes of Health grants GM075091, AG11762, AG05136, and AG14382.

## APPENDIX

#### EM ALGORITHM FOR MLE OF K-COEFFICIENTS

Start with arbitrary initial values of
$\mathbf{k}=({k}_{0}^{\left(0\right)},{k}_{1}^{\left(0\right)},{k}_{2}^{\left(0\right)})$. In the E-step, for given
$({k}_{0}^{\left(k\right)},{k}_{1}^{\left(k\right)},{k}_{2}^{\left(k\right)})$ and the observed IBS *S _{i,m}* of genotypes at marker

*m*, find the probability of IBD state

*X*=

_{m}*j*,

*j*= 0, 1, 2.

The probability of genotypes of IBS *S _{i,m}* for given

*X*is shown in Table I.

_{m}In the M-step, for given *Pr*(*X _{m}* =

*j*|

*S*),

_{i,m}*m*= 1…

*M*, update

*k*

_{0},

*k*

_{1}and

*k*

_{2}, which maximize the complete likelihood,

The updated estimate of k-coefficients is
${\widehat{k}}_{j}^{\left(k+1\right)}=\frac{1}{M}{\sum}_{m=1}^{M}\mathit{Pr}({X}_{m}=j\mid {S}_{i,m}),j=0,1,2$. Then, repeat the E-step and M-step until _{0}, _{1} and _{2} converge.

#### WEB RESOURCES

The R and Perl code for the proposed method will be available at http://faculty.washington.edu/wijsman/software.shtml

## References

- Abecasis GR, Cherny SS, Cookson WO, Cardon LR. Merlin-rapid analysis of dense genetic maps using sparse gene flow trees. Nat Genet. 2002;30:97–101. [PubMed]
- Agarwala R, Schaffer AA, Tomlin JF. Towards a complete north American Anabaptist genealogy II: analysis of inbreeding. Hum Biol. 2001;73(4):533–545. [PubMed]
- Anderson AD, Weir BS. A maximum-likelihood method for the estimation of pairwise relatedness in structured populations. Genetics. 2007;176:421–440. [PMC free article] [PubMed]
- Bacanu SA, Devlin B, Roeder K. The power of genomic control. Am J Hum Genet. 2000;66:1933–1944. [PMC free article] [PubMed]
- Bourgain C, Genin E. Complex trait mapping in isolated populations: Are specific statistical methods required? Eur J Hum Genet. 2005;13:698–706. [PubMed]
- Bourgain C, Hoffjan S, Nicolae R, Newman D, Steiner L, Walker K, Reynoldes R, Ober C, McPeek MS. Novel case-control test in a founder population identifies P-Selection as an atopy-susceptibility locus. Am J Hum Genet. 2003;72:612–626. [PMC free article] [PubMed]
- Cardon LR, Bell JI. Association study designs for complex diseases. Nat Rev Genet. 2001;2:91–99. [PubMed]
- Carlson CS, Eberle MA, Kruglyak L, Nickerson DA. Mapping complex disease loci in whole-genome association studies. Nature. 2004;429:446–452. [PubMed]
- Chapman NH, Wijsman EM. Genome screens using linkage disequilibrium tests: optimal marker characteristics and feasibility. Am J Hum Genet. 1998;63:1872–1885. [PMC free article] [PubMed]
- Crow JF, Kimura M. An introduction to population genetics theory. Harper & Row; New York, Evanston, and London: 1970.
- Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Stat Soc B. 1977;39:1–38.
- Devlin B, Roeder K. Genomic control for association studies. Biometrics. 1999;55:997–1004. [PubMed]
- Escamilla MA. Population isolates: their special value for locating genes for bipolar disorder. Bipolar Disord. 2001;3:299–317. [PubMed]
- Ewens WJ, Spielman RS. The transmission/disequilibrium test: history, subdivision and admixture. Am J Hum Genet. 1995;57:455–464. [PMC free article] [PubMed]
- Hirschhorn JN, Daly MJ. Genome-wide association studies for common diseases and complex traits. Nat Rev Genet. 2005;6:95–108. [PubMed]
- Jacquard A. Genetic information given by a relative. Biometrics. 1972;28:1101–1114. [PubMed]
- Laird NM, Horvath S, Xu X. Implementing a unified approach to family based tests of association. Genet Epidemiol. 2000;19(Suppl 1):S36–42. [PubMed]
- Lynch M, Ritland K. Estimation of pairwise relatedness with molecular markers. Genetics. 1999;152:1753–1766. [PMC free article] [PubMed]
- Marchini J, Cardon LR, Phillips MS, Donnelly P. The effects of human population structure on large genetic association studies. Nat Genet. 2004;36:512–517. [PubMed]
- McPeek MS, Sun L. Statistical tests for detection of misspecified relationships by use of genome-screen data. Am J Hum Genet. 2000;66:1076–1094. [PMC free article] [PubMed]
- Milligan BG. Maximum-likelihood estimation of relatedness. Genetics. 2003;163:1153–1167. [PMC free article] [PubMed]
- Ott J, Rabinowitz D. The effect of marker heterozygosity on the power to detect linkage disequilibrium. Genetics. 1997;147:927–930. [PMC free article] [PubMed]
- Peltonen L. Positional cloning of disease genes: advantages of genetic isolates. Hum Hered. 2000;50:66–75. [PubMed]
- Peltonen L, Palotie A, Lange K. Use of population isolates for mapping complex traits. Nat Rev Genet. 2000;1:182–190. [PubMed]
- Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D. Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet. 2006;38:904–909. [PubMed]
- Pritchard JK, Rosenberg NA. Use of unlinked genetic markers to detect population stratification in association studies. Am J Hum Genet. 1999;65:220–228. [PMC free article] [PubMed]
- Risch N, Merikangas K. The future of genetic studies of complex human diseases. Science. 1996;273:1516–1517. [PubMed]
- Ritland K. Estimators for pairwise relatedness and inbreeding coefficients. Genet Res. 1996;67:175–186.
- Shifman S, Darvasi A. The value of isolated populations. Nat Genet. 2001;28:309–310. [PubMed]
- Shmulewitz D, Zhang J, Greenberg DA. Case-control association studies in mixed populations: correcting using genomic control. Hum Hered. 2004;58:145–153. [PubMed]
- Slager SL, Schaid DJ. Evaluation of candidate genes in case-control studies: a statistical method to account for related subjects. Am J Hum Genet. 2001;68:1457–1462. [PMC free article] [PubMed]
- Spielman RS, McGinnis RE, Ewens WJ. Transmission test for linkage disequilibrium: the insulin gene region and insulin-dependent diabetes mellitus (IDDM) Am J Hum Genet. 1993;52:506–516. [PMC free article] [PubMed]
- The Wellcome Trust Case Control Consortium. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 2007;447:661–683. [PMC free article] [PubMed]
- Thompson EA. The estimation of pairwise relationships. Ann Hum Genet. 1975;39:173–188. [PubMed]
- Thompson EA. Monte Carlo in Genetic Analysis Technical report No 294. Department of Statistics, University of Washington; 1995.
- Thompson EA. Statistical inference from genetic data on pedigrees. Vol. 6. IMS/ASA; 2000.
- Thompson EA. Statistical Inferences from Genetic Data on Pedigrees NSF-CBMS Regional Conference Series in Probability and Statistics. Vol. 6. IMS; Beachwood, OH: 2000.
- Thompson EA. MCMC in the Analysis of Genetic Data on Pedigrees. In: Liang F, Wang J-S, Kendall W, editors. Markov Chain Monte Carlo: Innovations and Applications. Lecture Note Series of the IMS National University of Singapore. World Scientific Co Pte Ltd; Singapore: 2005. pp. 183–216.
- Thornton T, McPeek MS. Case-control association testing with related individuals: a more powerful quasi-likelihood score test. Am J Hum Genet. 2007;81:321–337. [PMC free article] [PubMed]
- Venken T, Del-Favero J. Chasing genes for mood disorders and schizophrenia in genetically isolated populations. Hum Mutat. 2007;28:1156–1170. [PubMed]
- Voight BJ, Pritchard JK. Confounding from cryptic relatedness in case-control association studies. PLoS Genetics. 2005;uc1(3):e32. [PMC free article] [PubMed]
- Wang J. An estimator for pairwise relatedness using molecular markers. Genetics. 2002;160:1203–1215. [PMC free article] [PubMed]
- Weir BS, Anderson AD, Hepler AB. Genetic relatedness analysis: modern data and new challenges. Nat Rev Genet. 2006;7:771–780. [PubMed]
- Wijsman EM, Rosenthal EA, Hall D, Blundell ML, Sobin C, Heath SC, Williams R, Brownstein MJ, Gogos JA, Karayiorgou M. Genome-wide scan in a large complex pedigree with predominantly male schizophrenics from the island of Kosrae: evidence for linkage to chromosome 2q. Mol Psychiatry. 2003;8:695–705. [PubMed]
- Wright AF, Carothers AD, Pirastu M. Population choice in mapping genes for complex diseases. Nat Genet. 1999;23:397–404. [PubMed]
- Zang Y, Zhang H, Yang Y, Zheng G. Robust genomic control and robust delta centralization tests for case-control association studies. Hum Hered. 2007;63:187–195. [PubMed]

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (954K)

- Design considerations for genetic linkage and association studies.[Methods Mol Biol. 2012]
*Nsengimana J, Bishop DT.**Methods Mol Biol. 2012; 850:237-62.* - ROADTRIPS: case-control association testing with partially or completely unknown population and pedigree structure.[Am J Hum Genet. 2010]
*Thornton T, McPeek MS.**Am J Hum Genet. 2010 Feb 12; 86(2):172-84. Epub 2010 Feb 4.* - Estimating kinship in admixed populations.[Am J Hum Genet. 2012]
*Thornton T, Tang H, Hoffmann TJ, Ochs-Balcom HM, Caan BJ, Risch N.**Am J Hum Genet. 2012 Jul 13; 91(1):122-38. Epub 2012 Jun 28.* - Overview of techniques to account for confounding due to population stratification and cryptic relatedness in genomic data association analyses.[Heredity (Edinb). 2011]
*Sillanpää MJ.**Heredity (Edinb). 2011 Apr; 106(4):511-9. Epub 2010 Jul 14.* - [Genetic aspects of genealogy].[Genetika. 2011]
*Tetushkin EIu.**Genetika. 2011 Nov; 47(11):1451-72.*

- Genetic and neurophysiological correlates of the age of onset of alcohol use disorders in adolescents and young adults[Behavior genetics. 2013]
*Chorlian DB, Rangaswamy M, Manz N, Wang JC, Dick D, Almasy L, Bauer L, Bucholz K, Foroud T, Hesselbrock V, Kang SJ, Kramer J, Kuperman S, Nurnberger J Jr, Rice J, Schuckit M, Tischfield J, Edenberg HJ, Goate A, Bierut L, Porjesz B.**Behavior genetics. 2013 Sep; 43(5)386-401* - REFINING GENETICALLY INFERRED RELATIONSHIPS USING TREELET COVARIANCE SMOOTHING[The annals of applied statistics. 2013]
*Crossett A, Lee AB, Klei L, Devlin B, Roeder K.**The annals of applied statistics. 2013 Jun 27; 7(2)669-690* - A Statistical Framework to Guide Sequencing Choices in Pedigrees[American Journal of Human Genetics. 2014]
*Cheung CY, Marchani Blue E, Wijsman EM.**American Journal of Human Genetics. 2014 Feb 6; 94(2)257-267* - CrypticIBDcheck: an R package for checking cryptic relatedness in nominally unrelated individuals[Source Code for Biology and Medicine. ]
*Nembot-Simo A, Graham J, McNeney B.**Source Code for Biology and Medicine. 85* - XM: Association Testing on the X-Chromosome in Case-Control Samples with Related Individuals[Genetic epidemiology. 2012]
*Thornton T, Zhang Q, Cai X, Ober C, McPeek MS.**Genetic epidemiology. 2012 Jul; 36(5)438-450*

- PubMedPubMedPubMed citations for these articles

- Case-Control Association Testing in the Presence of Unknown RelationshipsCase-Control Association Testing in the Presence of Unknown RelationshipsNIHPA Author Manuscripts. Dec 2009; 33(8)668PMC

Your browsing activity is empty.

Activity recording is turned off.

See more...