- We are sorry, but NCBI web applications do not support your browser and may not function properly. More information

# PLINK: A Tool Set for Whole-Genome Association and Population-Based Linkage Analyses

## Abstract

Whole-genome association studies (WGAS) bring new computational, as well as analytic, challenges to researchers. Many existing genetic-analysis tools are not designed to handle such large data sets in a convenient manner and do not necessarily exploit the new opportunities that whole-genome data bring. To address these issues, we developed PLINK, an open-source C/C++ WGAS tool set. With PLINK, large data sets comprising hundreds of thousands of markers genotyped for thousands of individuals can be rapidly manipulated and analyzed in their entirety. As well as providing tools to make the basic analytic steps computationally efficient, PLINK also supports some novel approaches to whole-genome data that take advantage of whole-genome coverage. We introduce PLINK and describe the five main domains of function: data management, summary statistics, population stratification, association analysis, and identity-by-descent estimation. In particular, we focus on the estimation and use of identity-by-state and identity-by-descent information in the context of population-based whole-genome studies. This information can be used to detect and correct for population stratification and to identify extended chromosomal segments that are shared identical by descent between very distantly related individuals. Analysis of the patterns of segmental sharing has the potential to map disease loci that contain multiple rare variants in a population-based linkage analysis.

In spite of a substantial body of research that spans decades, we have largely failed to elucidate the molecular genetic basis of most common, complex human traits and diseases. The genetic epidemiology of these outcomes has often convincingly demonstrated only two facts: that genetic factors play an important role, and that the genetic variation is not due to a single, Mendelian mutation. With this implication of polygenic effects (many genes of small effect) in mind, researchers have become increasingly aware of the need to design larger linkage and association studies that have adequate power.^{1}^{,}^{2} However, the strategies of the past decade have met with limited success.^{3}^{,}^{4} One possible reason for the lack of identified complex-trait disease genes is that studies have still been lacking in sample size and genome coverage.

Modern whole-genome association studies (WGAS) represent a direct attempt to address these problems. On the basis of advances arising from large-scale genomic projects—including the human genome sequence, SNP discovery efforts, and the HapMap project, as well as new genotyping technology—it has been only in the past 1 or 2 years that our understanding of variation and our technical ability to assess it have enabled association to move from candidate-gene to unbiased whole-genome searches. The standard logic of the WGAS design implicitly assumes that common variants with modest effects on disease frequently exist and explain substantial proportions of variation (i.e., the common disease/common variant [CD/CV] hypothesis)^{5}; this implies that previous studies either have not looked at them at all (i.e., not enough genetic markers tested) or have been underpowered to find significant associations (i.e., not enough samples). Eventually, WGAS should provide a powerful and comprehensive test of the CD/CV hypothesis for any given disease. In this report, we introduce a new analytic tool for WGAS and discuss some crucial analytic-design considerations, such as multiple testing, bias due to confounding, and the possibility that rare genetic variation underlies common disease.

New SNP genotyping technologies have enabled the next generation of genetic studies, with many WGAS either planned, under way, or already completed. A typical WGAS, currently with hundreds of thousands of SNPs genotyped for thousands of individuals, represents a data set that is several orders of magnitude larger than previous linkage and association studies. As such, WGAS present new computational and statistical challenges. Perhaps the most apparent challenge is from the increased multiple-testing burden: the concern that, from a set of hundreds of thousands of tests, many highly significant results are expected by chance alone, making it hard to distinguish signal from noise. To a large extent, this problem can be assuaged by moderate increases in sample size: basic power calculations show that maintaining the same power when performing an exponentially larger number of Bonferroni-corrected tests requires only a linear increase in sample size. For example, if 500 individuals are needed to test a single SNP with adequate power, ~2,000 individuals will be required to test 500,000 SNPs, even after Bonferroni correction. So, although increased sample sizes are certainly required, even with the most conservative statistical approaches, these will often be achievable, rather than order-of-magnitude, increases.

The size of these data sets will present a computational as well as statistical testing burden, since many existing genetic-analysis software programs were not designed with WGAS in mind. We have therefore developed a user-friendly software tool, ^{PLINK}, to facilitate the analysis of whole-genome data in a number of ways: by addressing the mundane but important need for easy ways to manage such data, by making routine analyses computationally efficient, and by offering new analyses that take advantage of whole-genome coverage. When a relatively small WGAS data set of 100,000 SNPs genotyped for 350 individuals is considered, for example, PLINK takes ~10 s to load, filter, and perform association analysis for all SNPs; straightforward handling of much larger data sets is also possible.

Aside from computational challenges, larger data sets also exacerbate the problem of confounding in genetic association studies. With increased power to detect true effects comes increased potential for bias to affect results (i.e., the “power to detect” departure from the null hypothesis due to unaccounted confounders will also increase). One well-acknowledged source of confounding in population-based association studies is population stratification.^{6} However, in the context of WGAS, this perhaps presents less of a problem, given the availability of hundreds of thousands of markers across the genome, which allows for a very accurate empirical assessment of stratification via genomic control^{7} and structured association methods.^{8}^{}^{–}^{10} Augmenting this set of approaches, we describe below our approach to stratification, implemented within PLINK and designed to work with whole-genome data.

Another arguably more insidious source of confounding in WGAS is from nonrandom genotyping failure,^{11}^{,}^{12} which involves an individual’s SNP genotype that is either incorrectly called or (more commonly) not called at all. If this failure is nonrandom with respect to genotype (e.g., some genotypes are more likely to be uncalled) and to phenotype (e.g., cases have lower genotyping rates than do controls, on average), then false-positive associations can occur. That certain genotypes for a given SNP are more likely than others to fail is almost certainly the rule rather than the exception for any genotyping technology. Furthermore, it is probably also the exception rather than the rule that cases and controls are collected at exactly the same time and place and are handled similarly throughout the laboratory process; indeed, control data may come from a completely different study, having been genotyped in a different laboratory and called with a different algorithm.

Even though genotyping rates might be very high in general, in large samples, even a small proportion of nonrandom genotyping failure could induce a false-positive association (especially if it occurs for one of the many SNPs already showing a high level of association by chance). Because normal screening procedures based on measures such as overall genotyping rate and Hardy-Weinberg equilibrium will often not detect these biased SNPs, it is important to look closely at patterns of genotyping failure for nonrandom effects (as well as to visually inspect the raw data before calling genotypes). In PLINK, genotyping failure can be examined with respect to both phenotype and (potentially unobserved) genotype.

For some complex traits and diseases, an alternate hypothesis for the lack of identified genes is that common variants do not explain a substantial proportion of the phenotypic variation. Under this model, the considerable levels of heritability could reflect aggregates of very many, very rare variants (each potentially of moderate effect but accounting for virtually none of the variation at the population level), which we refer to as the “multiple rare variant” (MRV) hypothesis.^{13} Standard association approaches will likely fail when the MRV hypothesis holds (power will be low even before multiple-testing corrections, even for high genotypic relative risks, e.g., >5, if the frequency is very low, e.g., ~1/10,000). Importantly, though, the same data collected for WGAS (in particular, panels of common SNPs genotyped in population-based samples) can potentially be analyzed using different approaches that do not assume that common variation underlies disease. In particular, if multiple rare disease variants exist *within the same gene or genomic region,* then, instead of standard association, one might consider an approach more akin to linkage analysis but performed in population-based samples of unrelated individuals. Rather than directly test frequency differences of a variant, we propose examining ancestral sharing at a locus, following ideas from previous work on haplotype sharing methods.^{14}^{,}^{15} That is, given ascertainment based on disease, we might expect to see multiple copies of even very rare variants that are moderately or highly penetrant among the descendants of the founder in whom the mutational events occurred.

In standard association analysis, undocumented relatedness can be another source of bias, although, with whole-genome data and analytic tools such as those described below, one can unambiguously detect closely related individuals. However, more-distant relatedness between individuals who share the same disease may convey additional information for gene mapping. Analyses of the type we propose here might be able to leverage this information to provide a complementary approach to standard association analysis, with use of the same data already being collected for single-SNP association studies.

If two individuals share the same rare variant, we would also expect that they share not just that variant but also the surrounding chromosomal region, particularly because rarer variants are more likely to be relatively recent. We propose the use of panels of common SNPs to look for these regions of extended sharing (regions that are inherited identical by descent [IBD] between seemingly unrelated individuals). If a particular region harbors multiple rare variants, we would expect to see inflated levels of segmental sharing between case/case pairs at that locus, compared with case/control and control/control pairs. A procedure for detecting shared segments and testing for correlation between sharing and phenotypic similarity forms the basis of a population-based linkage analysis, which is intended as an approach complementary to standard association analysis. This approach differs from standard haplotype analysis, in that we do not try to infer phase explicitly or estimate haplotype frequencies, so we can accurately assess sharing of very rare, very long regions; in addition, the subsequent unit of analysis is sharing at the locus rather than the frequency of one or more haplotypic variants.

In summary, given the issues raised above, we designed the ^{PLINK} WGAS tool set to meet the following requirements: (*a*) to provide a simple way to handle large WGAS data sets, (*b*) to assess confounding due to stratification and nonrandom genotyping failure and to produce a range of other summary statistics, (*c*) to perform a variety of standard association tests efficiently on very large data sets (in populations or families, for disease or quantitative outcomes, allowing for covariates, haplotypic tests, etc.), and (*d*) to provide a means of assaying rare variation with the use of common SNP panels, thereby providing a mapping method that might perform better when the MRV model holds. In the rest of this report, we highlight some of ^{PLINK}’s main features, briefly describing five domains of functions: data management, summary statistics, assessment of population stratification, association analysis, and IBD estimation. All these methods are applicable to whole-genome data sets. Below, we either describe or provide references for the tests implemented; these methods and other new ones being added are described in more detail in the online technical documentation being added to the PLINK Web site.

*Data management*.—We have developed a compact binary file format to represent SNP data, as well as tools to transform the binary format to standard text-based formats (including both a one-row-per-individual and a transposed one-row-per-SNP format). A simple interface is provided for reordering, recoding, and filtering genotype information (i.e., extracting individuals and/or SNPs on the basis of certain criteria, such as physical position, genotyping rate, or covariate values). It is also possible to merge two or more data sets that can partially overlap, in terms of both individuals and markers, and produce reports of discrepancies between overlapping data sets.

*Summary statistics*.—Standard summary measures are available: genotyping rates, allele and genotype frequencies, Hardy-Weinberg equilibrium tests using asymptotic and exact^{16} procedures, and single-SNP Mendelian error summaries for family data. PLINK also estimates individual heterozygosity rates and provides an automatic sex-check facility based on X-chromosome heterozygosity. We employ a per-SNP test of nonrandom genotyping failure with respect to phenotypic status, which is based on a simple χ^{2} test of different rates of genotyping failure in cases versus controls.

We also test whether missingness at a site can be predicted by the local haplotypic background, to spot nonrandom genotyping failure with respect to genotype. Taking each SNP that has an above-threshold level of genotyping failure as the reference SNP, we ask whether the haplotypes formed by the two (or more) flanking SNPs can predict which individuals are missing at the reference SNP. The test is a simple haplotypic case/control test, where the phenotype is the presence or absence of a called genotype at the reference SNP. If missingness at the reference SNP is not random with respect to the true (unobserved) genotype, we will often expect to see an association between missingness and flanking haplotypes. This test will often have higher specificity than sensitivity: it relies on linkage disequilibrium (LD) patterns to make an inference about the potentially unobserved reference allele, so it might miss many SNPs showing high, nonrandom levels of genotyping failure. However, used as a screening tool, SNPs that show highly nonrandom patterns of missing data could obviously be problematic and should be treated with caution.

For an example, we consider an Illumina whole-genome SNP data set, available free of charge from the National Institute of Neurological Disorders and Stroke (NINDS) Repository at ^{Coriell} (see the Acknowledgments for full details), comprising 276 amyotrophic lateral sclerosis cases and 271 controls. We illustrate the above tests for one particular SNP, *rs5742981.* Genotyping failure for this SNP was not randomly distributed with respect to phenotype (10.5% in cases and 0.7% in controls, *P*=2×10^{-7}) or genotype. Flanking haplotypes (formed by SNPs *rs1899025* and *rs5743030*) are very strongly associated with genotyping status at *rs5742981* (e.g., the GA haplotype is associated with the missing *rs5742981* genotype at *P*=9×10^{-56}). In fact, if we divide the sample into individuals who are heterozygous for the background haplotype (*N*=213) and those who are not (*N*=333), then all instances of genotyping failure at *rs5742981* fall into the smaller, heterozygous group (*P*=7.6×10^{-13}). It would seem that, in this particular case, heterozygosity for the haplotypic background predicts heterozygosity at the reference SNP and that heterozygotes are preferentially dropped in cases only. This would also seem to generate the association between allele count at *rs5742981* with disease (*P*=.0043; minor-allele frequency [MAF] 0.4% in cases and 2.6% in controls). Of course, such problems can often, but not always, be avoided by imposing appropriate limits on allele frequency, genotyping rate, and Hardy-Weinberg threshold. In this case, most Hardy-Weinberg filters would not have excluded this SNP (in controls, *P*=1; in the total sample, *P*=.11), although a missing-data threshold of 5% would have excluded this SNP. In any case, having additional simple quality-control (QC) metrics, including the two presented here, being automatically and quickly calculated by PLINK will often help to flag problematic SNPs.

*Population stratification*.—On the basis of the genomewide average proportion of alleles shared identical by state (IBS) between any two individuals, PLINK offers tools to (*a*) cluster individuals into homogeneous subsets, (*b*) perform classical multidimensional scaling (MDS) to visualize substructure and provide quantitative indices of population genetic variation, and (*c*) identify outlying individuals. PLINK uses complete-linkage hierarchical clustering to assess population stratification, with the use of whole-genome SNP data. This agglomerative procedure starts by considering every individual as a separate cluster of size 1, then repeatedly merges the two closest clusters. Complete-linkage clustering specifies that clusters are compared on the basis of their two most dissimilar members; clustering stops either when all individuals belong to one cluster or on the basis of prespecified constraints (stopping rules).

Various optional constraints can be applied with the specific goal of subsequent association analysis in mind, rather than an accurate description of population genetic variation per se. That is, we aim to ensure that all members of any derived cluster belong to the same subpopulation, rather than attempting to ensure that all members of the same subpopulation belong to the same cluster. The purpose of the constraints is to select which solution to accept from the distance-based clustering approach—that is, with no constraints, all solutions are considered (i.e., for *N* individuals, from *N* clusters, each of size 1, to 1 cluster of size *N*).

One constraint PLINK applies is called the “pairwise population concordance” (PPC) test, similar to a method used by Lee,^{17} such that for any putative new cluster, all pairs of individuals pass this test. For a given pair, we expect to see autosomal SNPs with two copies of each allele occur in a 2:1 ratio of IBS 2 {Aa,Aa} to IBS 0 {AA,aa} SNP pairs if both members of the pair come from the same random-mating population. For SNPs selected far enough apart to be approximately independent (e.g., 500 kb), a test of binomial proportions can suggest concordant or discordant ancestry for each pair of individuals. A pair from different populations is expected to show relatively more IBS 0 SNPs; a one-sided test for departure from a 2:1 ratio is given by the normal approximation to the binomial: for a particular pair, if *L* is the total number of informative, independent SNP pairs and *L*_{2} is the IBS 2 subset,

One can choose to merge clusters only if no between-cluster pairs have a statistically significant PPC result at a given significance threshold. In addition to the PPC test, we have incorporated other constraints in the clustering procedure. As mentioned above, nonrandom genotyping failure is a possible source of confounding in genetic association studies. One possible constraint is to cluster only individuals who have similar profiles of missing data, or “identity by missingness,” in which we specify a threshold for the maximum permissible proportion of sites for which two individuals are discordant in genotyping status (genotyped versus missing). For case/control samples, another possible constraint is that each cluster of two or more individuals has at least one case and one control (and so is informative for association analysis that conditions on cluster). Alternatively, the maximum cluster size or the number of clusters can be fixed. It is also possible to combine phenotype and cluster size constraints, by specifying that a cluster contains no more than one case and three controls, for example. Finally, one can also combine multiple external categorical and quantitative matching criteria (such as age, sex, other environmental variables, or QC measures such as the genotype call rate for each individual) alongside the genetic matching. Categorical criteria can be either “positive” or “negative,” such that only similarly categorized or differently categorized individuals can be merged. It is also possible to select only a single individual from a particular prespecified group. The complete algorithm is as follows: the IBS distance between individual *k* (belonging to cluster *i*) and individual *l* (belonging to cluster *j*) is denoted *d*_{ijkl}; the between-cluster distances are denoted *D*_{ij}.

- 1. START: Find valid
*i*,*j*for , where . - 2. Test (optional) constraints for this potential new cluster:
- → New cluster contains both cases and controls?
- → Merged
*i*+*j*cluster smaller than maximum cluster size constraint? - → Maximum number of cases or controls exceeded?

- 3. For every pair between
*i*and*j,*test the following (optional) constraints:- → Pairable based on external constraints?
- → Nonsignificant PPC test?
- → Pass identity-by-missingness threshold?
- → Already selected an individual from this group?

- 4. Satisfies constraints? → Merge clusters.
- 5. No remaining pairable clusters? → STOP.
- 6. Return to START for next best pair of clusters.

PLINK also provides an alternate way to look at population stratification: rather than clustering into discrete groups, it can use the data-reduction technique of classical MDS to produce a *k*-dimensional representation of any substructure. Although the primary use of this approach is for visualization, the values for each of the *k* dimensions, instead of discrete clusters, can be used as covariates in subsequent association analysis to control for stratification. There is an option to use a Euclidean IBS distance metric in place of the standard metric of proportional sharing; classical MDS based on a Euclidean distance metric is numerically identical to principal-components analysis, which forms the basis of other methods.^{10}

Finally, PLINK also supports an IBS-based “nearest-neighbor” analysis to detect outlying individuals who do not belong with any major cluster in the sample. For each individual, the distance to its nearest neighbor is calculated; this distribution is standardized (by the sample mean and variance of nearest-neighbor distances) and can be inspected for outliers. The same procedure can also be applied to individuals’ *n*th-nearest neighbor.

Here, we illustrate how these methods can differentiate between two quite similar populations and can control for between-population differences in tests of association, with the use of available ^{HapMap} data. From the 90 Asian individuals (45 unrelated Han Chinese from Beijing labeled “CHB,” and 45 unrelated Japanese from Tokyo labeled “JPT”) in the phase II data set, we extracted the set of autosomal SNPs on the Affymetrix GeneChip 500K Mapping Array. Figure Figure11 shows the results of an MDS analysis, which clearly separates two clusters (the left and right clusters correspond to CHB and JPT, respectively). The color coding in the three panels shows the classification of individuals according to increasingly liberal PPC thresholds: 0.01, 0.001, and 0.0001 (from left to right), which result in 7, 5, and 4 classes, respectively. In all solutions, two of the classes represent two single Japanese individuals: the nearest-neighbor diagnostics identified these two individuals (NA18987 and NA18992) as clear outliers >3 SDs below the mean.

*left*) and JPT (

*right*) HapMap populations. The figure’s three panels differ only in the color scheme, which represents classification

**...**

To mimic a data set showing stratification effects, we next created a dummy phenotype to represent genotype at *rs2976396,* a SNP that shows strong (*P*=2.7×10^{-8}) allele-frequency differences between CHB (frequency of the A allele is 0.24) and JPT (frequency of the A allele is 0.66) populations. Specifically, individuals with the GG genotype are designated cases, and AA and AG individuals are designated controls. The dummy phenotype therefore shows a marked difference in prevalence between CHB and JPT populations. When a standard association analysis is performed (methods described below) for all 500,000 SNPs with this phenotype, the genomic-control inflation factor is 1.163, which is indicative of some inflation (importantly, this would be sufficient to impact the tail of the test statistic distribution, perhaps pushing what would have been marginally significant results to genomewide-significant results). Performing tests of association, with between-strata effects controlled for, reduced the genomic-control inflation factor to virtually 1 for all three solutions (1.00474, 1.00315, and 1.00098 for PPC 0.01, 0.001, and 0.0001 solutions, respectively; when actual population membership, CHB versus JPT, was used instead of the empirically derived classification, the inflation factor was 1.00044).

Given the range of methods and options for detecting and correcting for population stratification (in addition to a genomic-control procedure^{7} implemented in PLINK), further work is clearly required to determine and quantify the typical performance of these approaches, in terms of the type I and type II error rates in subsequent association analysis. Further work is also needed to assess how to best apply these types of genetic-matching procedures when controls are selected from large, preexisting panels (i.e., as opposed to cases and controls being collected and genotyped together). Toward this goal, we are involved in work that provides direct comparisons of PLINK with other methods for real whole-genome data^{18} (R. Plenge, C. Cotsapas, L. Davies, A. L. Price, P. I. W. de Bakker, J. Maller, I. Pe’er, N. Burtt, B. Blumenstiel, M. DeFelice, et al., unpublished data).

*Association analysis*.—As well as the standard case/control allelic test, PLINK offers a Cochran-Armitage trend test, Fisher’s exact test, genotypic tests (general, dominant, and recessive models), and Cochran-Mantel-Haenszel tests for stratified tables,^{19} which allow for tests of association conditional on any cluster solution or other categorization of samples. The Breslow-Day^{19} and homogeneity of odds ratio^{20} tests are supported, as are tests for quantitative traits that use a standard linear regression of phenotype on allele dosage. The standard disease- and quantitative-trait association tests are implemented for speed; the same analyses are also framed as more general linear- and logistic-regression models that allow for multiple binary or continuous covariates having both main effects and interactions. One can test for joint effects or perform a scan conditional on a given SNP or set of SNPs, for example; also, gene-gene and gene-environment interaction tests for quantitative and disease traits can be performed.

For family data, the standard transmission/disequilibrium test (TDT)^{21} is provided. The permutation procedure applied to the TDT flips the transmitted and untransmitted alleles of all individuals in a nuclear family for all SNPs per permutation, thereby preserving in each permuted data set the possible nonindependence of transmissions across SNPs and across multiple offspring due to LD and linkage. We also implement the sib-TDT^{22} for nuclear families, to include sibships without parents as well as unrelated individuals (called the “DFAM test” within PLINK). We break pedigrees into nuclear families and classify them as those in which both parents are genotyped and those in which they are not. For the first class of families, we obtain the allele count of the minor allele (*A*; major allele is *a*) among affected children in family *f,* labeled *S*_{f}. Under the null hypothesis of no association, the binomial distribution gives the expected value and variance of *S*_{f}, given the parental genotypes (i.e., sampling parental transmissions with replacement). If family *f* contains *D* affected offspring, then *E*(*S*_{f})=*DP*, where *P* is 1/2, 1, or 1 1/2 for aa×Aa, Aa×Aa, and AA×Aa parental mating types, respectively, and *Var*(*S*_{f})=*D*/4 unless both parents are heterozygous, in which case *Var*(*S*_{f})=*D*/2. For the second class of families, we also obtain *S*_{f}, the count of minor alleles in affected offspring and its expected value and variance (under *H*_{0}, based on the genotypes of all siblings in the family); these are given by the multivariate hypergeometric distribution (i.e., sampling genotypes without replacement). The use of the genotype-based multivariate hypergeometric distribution in sibships accounts for the fact that not all allelic combinations are possible within a sibship (e.g., an individual cannot have two paternal alleles). The number of all offspring with genotypes AA, Aa, and aa in family *f* are labeled *N*_{AA}, *N*_{Aa}, and *N*_{aa}, respectively (which sum to *N*); the equivalent numbers in affected offspring are *D*_{AA}, *D*_{Aa}, and *D*_{aa}, which sum to *D*; therefore, *S*_{f}=2*D*_{AA}+*D*_{Aa}. The expected allele count in affected individuals in family *f* under the null hypothesis is *E*(*S*_{f})=2*E*(*D*_{AA})+*E*(*D*_{Aa})=(2*N*_{AA}+*N*_{Aa})(*D*/*N*), and the variance, obtained using the multivariate hypergeometric distribution, is

Summing over families, a test statistic is

which follows a χ^{2}_{1} distribution under the null hypothesis. Additional unrelated cases and controls, potentially stratified into clusters, can be included within this framework if they are treated as sibships and the standard hypergeometric distribution is used (i.e., sampling *alleles* without replacement), which is equivalent to the standard Cochran-Mantel-Haenszel test mentioned above.

For quantitative traits, PLINK provides an implementation of the between/within model,^{23}^{,}^{24} which uses a permutation procedure (permuting genotype rather than phenotype) to control for the nonindependence of individuals within the same family (the QFAM test). The analysis of phenotype-genotype association is a standard regression of phenotype on genotype that ignores family structure. Significance is based on the following permutation procedure: genotypes are decomposed into between- and within-family components, following the models referenced above; these two components are then permuted independently at the level of the family and are summed to form new pseudogenotype scores for each individual. That is, between components are swapped between families; within components have their sign swapped, with a 50% chance (similar for all members of the same family). This approach provides tests that give correct type I error rates accounting for the relatedness between individuals. Despite the necessity of permutation, one advantage is that nonnormal and dichotomous phenotypes can be appropriately analyzed. Whereas the basic test is of total association, the between and within components can also be tested separately. Information about parental phenotypes can also be combined in these analyses.^{25}^{,}^{26}

There is support for haplotype-based case/control and quantitative trait tests and TDTs based on the expected haplotype distribution for each individual obtained from expectation-maximization phasing. Either prespecified lists or sliding windows are used to specify the particular haplotype tests; precomputed lists of efficient sets of tests for common WGAS products based on ^{HapMap}^{,27} are available from the PLINK Web site and can be immediately applied to these data sets. Also, two nonhaplotypic multilocus “gene-based” or “set-based” tests are available: sum-statistics^{28} and, for case/control samples, Hotelling’s *T*^{2}.

For many tests, a number of permutation procedures are available: “adaptive” permutations, which give up early on clearly nonsignificant results^{29}; a “max(T)” permutation to correct for multiple tests^{30}; a rank-ordered permutation in which the *n*th-best original result is compared against the *n*th best in each permuted data set; gene-dropping for family-based tests; and, finally, the between/within permutation scheme described above. A range of multiple-test corrections are also available, including those based on Bonferroni correction and false-discovery rate.^{31}

*IBD estimation.*—The final domain of function concerns IBD estimation. In homogeneous samples, PLINK provides options to estimate genomewide IBD-sharing coefficients between seemingly unrelated individuals from whole-genome data.^{32} These metrics (probabilities of sharing 0, 1, or 2 alleles IBD) can be particularly useful for QC, by diagnosing pedigree errors, undetected relationships, and sample swap, duplication, and contamination events.

PLINK has a simple procedure to find extended stretches of homozygosity in whole-genome data (regions spanning more than a certain number of SNPs and/or kilobases, allowing for a certain amount of missing genotypes and/or occasional heterozygote calls) that occur relatively frequently, and it can provide a powerful approach to map recessive disease genes.^{33}^{,}^{34} Via permutation, an empirical *P* value can be calculated for each SNP on the basis of a test for whether there is a higher rate of homozygous segments spanning that position in cases versus controls. PLINK also has options to determine distinct sets of overlapping (and, optionally, allelically matching) segments, thereby allowing for further inspection of the data.

PLINK also calculates inbreeding coefficients for each individual. Specifically, for a particular SNP with known allele frequencies *p* and *q,* the probability that individual *i* is homozygous equals *f*_{i}+(1-*f*_{i})(*p*^{2}+*q*^{2}), or the probability of being autozygous (homozygous by descent) (*f*_{i}) plus the probability of being homozygous by chance. If individual *i* has *L*_{i} genotyped autosomal SNPs, let *O*_{i} be the number of observed homozygotes and *E*_{i} be the number expected by chance; then, *O*_{i}=*f*_{i}×*L*_{i}+(1-*f*_{i})*E*_{i}, which gives *f*_{i}=(*O*_{i}-*E*_{i})/(*L*_{i}-*E*_{i}). When allele frequencies are not known but are estimated from the sample, an unbiased estimator of *E*_{i} is based on the sum over all SNPs not missing for individual *i*: , where *T*_{Aj} is twice the number of nonmissing genotypes for SNP *j.*

We have also implemented a novel method to detect extended chromosomal segmental IBD sharing between pairs of distantly related individuals by use of a hidden Markov model (HMM), in which the underlying hidden IBD state is estimated given the observed IBS sharing and genomewide level of relatedness between the pair. We also provide a test for correlation between segmental chromosomal sharing and phenotypic sharing. This test, a population-based linkage analysis, potentially offers a complementary approach to whole-genome data that does not assume the common variant hypothesis of disease-related genetic variation. We describe our approach in three steps: estimation of genomewide relatedness, estimation of local segmental sharing, and relating pairwise segmental sharing to phenotypic similarity.

We use a method-of-moments approach to estimate the probability of sharing 0, 1, or 2 alleles IBD for any two individuals from the same homogeneous, random-mating population. If we denote IBS states as *I* and IBD states as *Z* (in both cases, the possible states being 0, 1, and 2), then we can express the prior probability of IBS sharing as

As described in detail below, for each SNP, we specify *P*(*I*|*Z*) in terms of the allele frequency; averaging over all SNPs, we obtain the expected value for *P*(*I*|*Z*). Then, rearranging the three equations implied by equation (1), we solve for *P*(*Z*=0), *P*(*Z*=1), and *P*(*Z*=2) and calculate

the proportion of alleles shared IBD.

For all SNPs, we calculate allele frequencies (on the basis of only founders if family information is present). For any one marker, *P*(*I*|*Z*) is a function of allele frequency (for alleles A and a, these are *p* and *q*=1-*p*, respectively). If *p* and *q* were known with certainty, then 2*p*^{2}*q*^{2} would, for example, be an unbiased estimator of *P*(*I*=0|*Z*=0) (i.e., this requires that both individuals have opposite homozygotes, either {AA/aa} with probability *p*^{2}×*q*^{2} or {aa/AA} with probability *q*^{2}×*p*^{2}). However, because *p* and *q* are estimated only from a finite sample, there is a bias that we take into account as follows. Let *X* and *Y* equal the counts of the two alleles in the sample for a particular SNP, so that *p*=*X*/*T*_{A} and *q*=*Y*/*T*_{A}, where *T*_{A} is twice the number of nonmissing genotypes. There are *T*_{A}(*T*_{A}-1)(*T*_{A}-2)(*T*_{A}-3) possible ways of selecting four distinct alleles from *T*_{A} alleles; of these, *X*(*X*-1)*Y*(*Y*-1) will be {AA/aa} genotype pairs and *Y*(*Y*-1)*X*(*X*-1) will be {aa/AA}. Therefore,

which, reexpressed in terms of the original probabilities and a correction factor based on allele counts, equals

Following a similar logic, the full set of *P*(*I*|*Z*) is given in table 1.

Conditional on IBD state *Z*=*z* for the entire genome, the expected count of SNPs with IBS state *I*=*i* is given as , where the summation is over all SNPs with genotype data on both individuals. Then, from equation (1), we can obtain global IBD estimates of *P*(*Z*) for that pair by the method of moments, substituting into

and

These estimates of *P*(*Z*) are not bounded 0*x*1 and are also not constrained to biologically plausible values (e.g., 0.5, 0, and 0.5 are not plausible values for IBD states 0, 1, and 2, respectively). In practice, we bound these estimates as follows. If *P*(*Z*=0)>1, then *P*(*Z*=0) is set to 1 and *P*(*Z*=1) and *P*(*Z*=2) are set to 0. If *P*(*Z*=0)<0, then *P*(*Z*=0) is set to 0 and *P*(*Z*=1) and *P*(*Z*=2) are set to *P*(*Z*=1)/*S* and *P*(*Z*=2)/*S*, respectively, where *S*=*P*(*Z*=1)+*P*(*Z*=2).

If π^{2}*P*(*Z*=2), where

to constrain IBD estimates to biologically plausible values (assuming a homogeneous, random-mating population), we find a new value for *P*(*Z*=1), which we label *P*_{*}(*Z*=1), which satisfies the equation

which gives *P*_{*}(*Z*=1)=2π(1-π). The transformed IBD probabilities, used in all subsequent calculations, are therefore

and

When not constrained to biologically plausible values, genomewide IBD sharing estimates can be used for QC and to indicate and diagnose sample and genotyping errors, including swaps, duplications, and contamination events, as well as misspecified or undetected familial relationships. For example, values of *P*(*Z*=2) near 1 clearly indicate duplicated samples (or MZ twins). Alternatively, if an experiment is conducted on two separate chips (e.g., two 250K SNP arrays comprising a 500K array), values near 0.5, 0, and 0.5 might represent an individual duplicated for one 250K array only.

If DNA from one or more individual contaminates other samples, this can lead to a distinctive pattern of contaminated samples showing high IBD with all other individuals. This is because contamination induces false heterozygote calls (e.g., AA pooled with CC may well be typed as AC), and heterozygotes cannot be IBS 0 with any other SNP genotype, which artificially inflates IBD estimates. Furthermore, contaminated samples will show strong, negative inbreeding coefficients, indicating more heterozygotes than expected.

In analogy to the traditional Lander-Green algorithm for multilocus analysis,^{35} we use an HMM approach to provide multipoint estimates of allele-sharing IBD for each pair of individuals in a homogeneous sample at any arbitrary position along the chromosome, given the observed pattern of IBS sharing. Note that, unlike full inheritance vectors, IBD states along the chromosome do not actually satisfy the Markov property. Nevertheless, we used an HMM because it is computationally tractable and likely to give a good approximation.

We require the conditional probability of IBD for *z*=0, 1, or 2 at a particular position, given the marker genotypes *M* of all *K* markers on a chromosome, *P*(*Z*=*z*|*M*). This can be reexpressed, using the Bayes theorem, as

Here, *P*(*Z*=*z*) is the global IBD sharing probability for the whole genome, and the summation is over the three possible IBD states. Because of the Markov property, the probability *P*(*M*|*Z*=*z*) can be factorized as the product *P*(*M*_{L}|*Z*=*z*)×*P*(*M*_{R}|*Z*=*z*), where *M*_{L} and *M*_{R} are the marker genotypes to the left and to the right, respectively, of the position. Suppose the position is between markers *l* and *l*+1; then, the Markov property ensures that

where the summation is over all possible IBD states for all markers. Writing the 3×3 diagonal matrix of marker genotype probabilities conditional on IBD state for marker *l* as *M*_{l} and the 3×3 transition matrix between marker *l* and *l*+1 as *T*_{l} (where element *t*_{ij} is the conditional probability of marker *l* having IBD state *j,* given that marker *l*+1 has IBD state *i*), this summation can be written in matrix form as

where *T*_{L} is the transition matrix between marker *l* and the position, **1** is a 3×1 vector of 1s, and *z* is a 3×1 column vector that has value 1 for element *z* and value 0 for the others. The elements of *M* and *T* are given in tables tables22 and and3,3, respectively, and are described in the section below.

**T**Specifying

*P*(

*Z*

_{l+1}|

*Z*

_{l}) for Diploid IBD State from Locus

*l*to Locus

*l*+1

^{[Note]}

The expression *P*(*M*_{L}|*Z*) represents the “left conditional” probability based on markers 1 through *l*; the same procedure is used to calculate the chain of right-conditional probabilities for markers *K* back through *l*+1:

where *T*_{L+1} is the transition matrix between marker *l*+1 and the position.

For computational efficiency, the left conditionals (*M*_{1}*T*^{′}_{1}*M*_{2}*T*^{′}_{2}…*M*_{l}) and right conditionals (*M*_{K}*T*^{′}_{K-1}*M*_{K-1}*T*^{′}_{K-2}…*T*^{′}_{l+1}*M*_{l+1}) are precomputed for different values of *l* and are used in the computation for conditional IBD probabilities of all positions. In practice, normalization steps must be performed at intervals to avoid underflow.

On the basis of these estimates of *P*(*Z*=*z*|*M*), shared segments are defined as any contiguous region having a >50% chance of having at least one pair of chromosomes shared IBD (we can ignore the negligible probability of distantly related individuals sharing two segments IBD). Please see appendix A for a detailed description of the HMM emission and transition parameters.

One requirement of this approach is that SNPs are in approximate linkage equilibrium in the population; otherwise, many small regions of high LD will be called as shared IBD segments. One approach is to prune the SNP panel to a reduced subset of approximately independent SNPs. As outlined in the next section, for the purpose of population-based linkage analysis, we expect most segments surrounding shared rare, recent variants to be relatively large, and therefore detectable, with a less dense SNP panel. We use a repeated sliding-window procedure, recursively pruning SNPs on the basis of pairwise SNP *r*^{2} values and/or the variance inflation factor, which is defined as 1/(1-*R*^{2}), where *R*^{2} is the multiple correlation coefficient between a SNP and all other SNPs in the window based on allele counts. In the context of the population-based linkage test described below, failure to completely prune all sample-level LD should not be particularly troublesome. At worst, it will simply mean that a number of more common extended segments (that are perhaps better tested in a standard association design) will be included in the analysis, possibly reducing efficiency.

The results of applying this method to phase II ^{HapMap} data, in terms of the typical distribution and extent of extended segmental sharing and its relationship to rare variation, are described in the International HapMap Consortium phase II analysis manuscript (International HapMap Consortium, unpublished data). Here, we apply the method to CEU (Utah residents with European ancestry from the CEPH collection) HapMap individuals. For illustrative purposes, we focus here only on a particular region of chromosome 9 shared between two families (parent-offspring trios). Genomewide data about the full CEU panel were still used to calculate allele frequencies (founders only) and global IBD-sharing estimates. We selected SNPs with complete genotyping and MAF >1% and then iteratively removed SNPs showing local LD; the final chromosome 9 data set comprised 6,513 SNPs (~1 per 20 kb). In the CEU sample as a whole, there is virtually no LD in the pruned data set: for the entire chromosome 9, only six LD blocks are identified, comprising 15 SNPs in total (three blocks containing 2 SNPs and three blocks containing 3 SNPs).

We present here a single segment shared between two CEU offspring, NA10863 and NA06991. This segment spans ~3.7 Mb and, in the pruned data set, 272 SNPs. The phase II ^{HapMap} has >4,500 SNPs in this region (CEU panel) and dozens of estimated recombination hotspots. We selected a segment shared between two offspring for illustrative purposes: given that the region is shared IBD, we would also expect to see a pattern of sharing consistent with transmission from one and only one parent in each family, as illustrated in figure 2. That is, we also observe the same segment shared between each offspring and one parent of the other family and also between these two parents. (Naturally, parent-offspring pairs within the same family are always IBD 1; these basic intrafamilial relationships are not shown in fig. 2).

*P*(

*Z*=1), for a 25-Mb region of chromosome 9, for the pairs of individuals between

**...**

No other pairs of individuals show any extended segmental sharing in this same region. This type of rare, extended segment is an example of shared genetic variation that is outside the standard heuristic and analytic framework of LD involving only short, common “haplotype blocks” separated by recombination hotspots. Naturally, this approach can also be applied to detection of the much longer segments shared between very closely related individuals. There are, in fact, a number of close relationships between HapMap founders: in the CEU and YRI (Yoruba individuals from Ibadan, Nigeria) panels, there are a number of cousins and individuals with closer relationships (excluding known parent-offspring relationships, of course). For example, two CEU individuals, NA12154 and NA12264, have a global *P*(*Z*=1) of .14; shared segment analysis reveals at least 33 segments >1 Mb for this single pair. Six segments are >10 Mb, and the longest is 128 Mb (they share virtually all of one copy of chromosome 11). For this pair, the total proportion of the autosomes spanned by these segments (in terms of physical distance) is ~0.12, which is close to the global probability of sharing one copy IBD of .14.

In a population-based sample of cases and controls, having determined the extent and location of pairwise segmental sharing with use of the approach described above, one might also want to inquire whether patterns of segmental sharing are related to phenotypic similarity between individuals. In this section, we describe a test based on the premises that (*a*) shared rare variants will typically reside on shared extended segments and (*b*) there will be an inflation in the rate of segmental sharing at the disease locus in case/case pairs if rare variants influence disease risk. This represents a first-generation approach that can no doubt be extended and improved in numerous ways—for example, by considering other statistics based on segmental sharing.^{37}^{,}^{38}

A sample of *N*_{A} cases and *N*_{U} controls contains *N*_{AA}=*N*_{A}(*N*_{A}-1)/2 case/case pairs and *N*_{!AA}=*N*_{A}*N*_{U}+*N*_{U}(*N*_{U}-1)/2 case-control and control/control pairs. At a particular position *p,* the number of segments shared at a particular locus (i.e., spanning that position) is denoted *S*_{p} for all case/case pairs and *T*_{p} for case/control and control/control pairs. A standard test for a difference in rate of sharing between these two groups is complicated by the fact that not all pairs are independent (since the same individuals will possibly feature in multiple pairs); also, not all pairs have similar degrees of global relatedness. To account for the dependence, we use permutation to generate empirical significance values by label-swapping individuals’ phenotypes and recalculating the pairwise phenotypic concordance metrics (i.e., rather than permuting the pairwise concordance terms directly). The test statistic is framed as a one-sided test (i.e., greater sharing in case/case pairs) and adjusts for the average level of global sharing in the two categories; for position *p* of *L* positions,

which is bounded at 0 (in addition,

and

are also bounded at 0). The permutation procedure is computationally feasible, since the IBD segments do not need to be recomputed for each replicate.

Given regions that show statistically significant levels of increased sharing among cases, one can use PLINK to determine the allelic identity of the specific sets of overlapping segments present (although without inferring phase when both individuals are heterozygous). The results of such an analysis should be similar in principle to those from a linkage analysis, except that signals will be localized on a finer scale.

*Performance*.—Very large WGAS data sets can be analyzed using fairly standard hardware, and there are no fixed limits on the number of samples or SNPs. For example, a Linux workstation with 2 GB random-access memory and a 3.6-GHz dual processor can handle >5,000 individuals genotyped for 500,000 SNPs. On the same machine, one can load (using a binary format file set) the entire phase II ^{HapMap} (3.6 million SNPs for 270 individuals), filter on genotyping rate, and then calculate and output allele frequencies for all SNPs in <6 min.

Although analyses that involve pairwise comparisons between all individuals can take a long time in large samples (in particular, calculating genomewide IBD and IBS for all individuals), if a cluster-computing environment is available, such jobs are easily parallelized, and PLINK provides some options to facilitate this, potentially reducing analyses that might take days to 1 or 2 h with little extra work on the part of the user.

*Graphical user interface*.—We have also developed a separate, optional Java-based graphical user interface, ^{gPLINK}, to initiate, track, and record PLINK jobs. In addition, gPLINK provides integration with ^{Haploview}^{,39}; version 4 of ^{Haploview} offers extensive tools for tabulating, filtering, sorting, merging, and visualizing PLINK WGAS output files in the context of ^{HapMap} LD and genomic information. gPLINK can also extract filtered subsets of WGAS data for viewing in Haploview with just a few mouse clicks. gPLINK either can be used to direct local analyses (with data and computation residing on the same local machine) or can remotely use secure shell networking (with data and computation performed by the remote server, but with initiation and viewing of results done locally). Figure 3 illustrates the relationship among PLINK, gPLINK, and ^{Haploview}.

**...**

In summary, PLINK offers a powerful, user-friendly tool for performing many common analyses with whole-genome data. There is comprehensive Web-based documentation, including a tutorial, an e-mail list, and a Web-based version-check to inform users of updates and problems. As methods for WGAS evolve, we expect that PLINK will be updated. For example, as of this report going to press, newly added features include support for R plug-ins to extend the basic functionality of PLINK, a Web-based SNP annotation look-up tool, and a set of “proxy association” methods, designed to explore single SNP associations in their local haplotypic context. The proxy association tools provide, among other things, a haplotype-based single-SNP test that can often be more robust to nonrandom missing genotype data. Future directions include enhanced tools for browsing annotated WGAS results in their full genomic context and the incorporation of copy-number–variation data.

We have also implemented an approach based on the MRV hypothesis that is designed to be a complement, rather than an alternative, to association analysis. In this report, we have outlined our analytic approach and have described an implementation of the method that is appropriate for whole-genome SNP data. Following decades of work mapping Mendelian disease genes, this approach uses haplotypes of common alleles to measure very rare variation. This is an example of how one can take multiple approaches to existing high-density SNP array data, rather than needing to embark on a completely orthogonal data collection to execute a MRV-oriented test.

Standard association will be much more powerful when a single common causal variant is directly assayed or well captured by a tag SNP. When the CD/CV hypothesis does not hold, however, we hope that this approach will perform better. In this case, straightforward association approaches are unlikely to succeed, since the rare variants will most likely not be identified, genotyped, or tagged with sufficient precision; in any case, there will typically be too few observations to provide adequate statistical power for standard association tests of any one rare variant.

We are currently embarking on the next step—to determine the potential power of such an approach under a range of scenarios and to determine the best way to apply this method to real data. Possible extensions of this approach include allowing for LD between SNPs, genotyping error, and inbreeding in the IBD estimation. This approach could also be applied to detecting autozygous segments within a single individual, allowing for population-based homozygosity mapping to map recessive disease loci.

## Acknowledgments

We acknowledge support from the National Institutes of Health (NIH) National Heart, Lung, and Blood Institute ENDGAME project grant U01 HG004171 (to S.P., M.J.D., and P.I.W.d.B.), from NIH grant EY-12562 (to S.P. and P.C.S.), from The Research Grants Council of Hong Kong, Project Number HKU 7669/06M (to S.P. and P.C.S.), from The University of Hong Kong Strategic Research Theme on Genomics, Proteomics and Bioinformatics (to P.C.S.), from National Health and Medical Research Council of Australia Sidney Sax fellowship 389927 (to M.A.R.F.), and from NIH/National Institute of Mental Health grant R03 MH73806-01A1 (to S.P.). We also thank the NINDS Repository at Coriell for making the data from the Laboratory of Neurogenetics (part of the intramural program of the National Institute on Aging, NIH) available free of charge. These data were deposited by John Hardy and Andrew Singleton; we accessed the data (upload identification numbers 7 and 8) at the Queue portal at the Coriell Institute. Finally, we thank PLINK users, both within the Broad Institute Medical and Population Genetics Program and elsewhere, for all feedback.

## Appendix A: Description of HMM Emission and Transition Parameters, *M* and *T*

The elements of *M*—the probability of the pair’s genotypes for that SNP conditional on IBD state—*P*(*M*|*Z*)) are calculated according to table 2. These values are a function of allele frequency, including an ascertainment correction term, which follows the procedure described above in the calculation of global IBD probabilities. Markers with missing genotypes are assigned an identity matrix *M*.

The elements of *T*, the transition probabilities between two IBD states at neighboring loci, are precalculated in terms of the recombination fraction estimated from a specified genetic map and global relatedness for that pair, *P*(*Z*). With dense SNP maps, the method is not particularly sensitive to the precise genetic map used: in practice, a basic 1 *cM*=1 *Mb* approximation appears to work well, although one could also use the fine-scale recombination map.^{36} For each pair of individuals, we estimate the least number of meioses that separate the two genomes and use these estimates to specify a transition matrix for unobserved IBD states along the chromosome. Specifically, we consider two chromosomes, or haploid genomes (*C*_{1} and *C*_{2}), that share a common ancestor (*C*_{C}, a diploid genome), with *C*_{1} and *C*_{2} separated by *m* meioses. If *C*_{1} and *C*_{2} are present in distinct individuals, then *m*2, whereas, if *C*_{1} and *C*_{2} are in the same person, then *m*3. At a particular locus, *U,* the probability that *C*_{1} and *C*_{2} are IBD is (1/2)^{(m-1)}. That is, if we label the allele transmitted in the first meiosis from *C*_{C} as “*u*_{1},” there is probability (1/2)^{(m-1)} that all the other *m*-1 meioses also transmit *u*_{1}. Now, consider a second locus, *V,* that is linked to *U* with recombination fraction θ. Let the alleles at *V* present in *C*_{C} be *v*_{1} and *v*_{2}, with allele *v*_{1} in coupling phase with allele *u*_{1}. Figure A1 shows some examples of possible transmission patterns and the corresponding IBD states for *C*_{1} and *C*_{2}, which we outline here.

For *C*_{1} and *C*_{2} to be IBD at *V,* they must either both share allele *v*_{1} or allele *v*_{2}. For *C*_{1} and *C*_{2} to both share allele *v*_{1}, given that *C*_{1} and *C*_{2} are IBD at *U* for allele *u*_{1}, which is in coupling phase with *v*_{1} in the common ancestor *C*_{C}, all the *m* meioses must be nonrecombinants, so that *v*_{1} is cotransmitted with *u*_{1} all the way down to *C*_{1} and *C*_{2}. The probability of this is (1-θ)^{m}. For *C*_{1} and *C*_{2} to both share allele *v*_{2}, given that *C*_{1} and *C*_{2} are IBD at *U* for allele *u*_{1}, which is in repulsion phase with *v*_{2} in the common ancestor *C*_{C}, the two meioses of *C*_{C} must both be recombinants (so that *v*_{2} crosses over to be cotransmitted with *u*_{1}), and all the remaining *m*-2 meioses must be nonrecombinants, so that *v*_{2} is cotransmitted with *u*_{1} all the way down to *C*_{1} and to *C*_{2}. The probability of this is θ^{2}(1-θ)^{(m-2)}. If these two possibilities are taken together, the probability that *C*_{1} and *C*_{2} are IBD at *V,* given that they are IBD at *U,* is

which can be rewritten as (1-θ)^{(m-2)}[θ^{2}+(1-θ)^{2}]. The probability that *C*_{1} and *C*_{2} are IBD at *V,* given that they are not IBD at *U,* is given by the Bayes theorem:

which simplifies to

This equation involves two parameters, θ and *m.* Because of the assumption of very closely spaced markers, we use the Morgan map function θ=*d*, where *d* is the genetic distance between the loci in Morgans. For *m,* we consider the two pairs of haploid genomes (each pair containing a haploid genome from each individual) that may have common ancestry and estimate the numbers of meioses (*m*_{A} and *m*_{B}) that separate the two pairs. If *x*_{A} and *x*_{B} are the probabilities that the two pairs of haploid genomes are IBD, then

and

Substituting *x*_{B}=*P*(*Z*=2)/*x*_{A} into the second expression, we obtain

Solving this equation gives *x*_{A} and *x*_{B}. These IBD probabilities (*x*) are related to the number of meioses separating the haploid genomes (*m*) by *x*=(1/2)^{(m-1)}. Therefore, if *P*(*Z*=2)=0, then

and *m*_{B}=0; otherwise,

and

Note that the quadratic equation

has real roots only if

This gives rise to the inequality *P*(*Z*=2)π^{2} in the previous section on constraining global IBD estimates to biologically plausible values. The form of the transition matrix for pairs of haploid genomes (denoted *A* and *B*) is given in table A1; these are combined to form the full transition matrix for a diploid genome in table 3.

### Table A1.

Haploid IBD State | h^{(l+1)}_{0} | h^{(l+1)}_{1} |

h^{(l)}_{0} | ||

h^{(l)}_{1} | 1-(1-θ)^{m-2}ξ | (1-θ)^{m-2}ξ |

Note.— Probability of haploid IBD states 0 (*h*_{0}) and 1 (*h*_{1}) at locus *l*+1 conditional on state at locus *l*; *m* is the estimate of the least number of meioses for that haploid pair of genomes, and ξ=θ^{2}+(1-θ)^{2}.

## Web Resources

The URLs for data presented herein are as follows:

## References

**American Society of Human Genetics**

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (661K)

- Inference of relationships in population data using identity-by-descent and identity-by-state.[PLoS Genet. 2011]
*Stevens EL, Heckenberg G, Roberson ED, Baugher JD, Downey TJ, Pevsner J.**PLoS Genet. 2011 Sep; 7(9):e1002287. Epub 2011 Sep 22.* - Genome-wide linkage analysis with clustered SNP markers.[J Biomol Screen. 2009]
*Selmer KK, Brandal K, Olstad OK, Birkenes B, Undlien DE, Egeland T.**J Biomol Screen. 2009 Jan; 14(1):92-6.* - High-resolution detection of identity by descent in unrelated individuals.[Am J Hum Genet. 2010]
*Browning SR, Browning BL.**Am J Hum Genet. 2010 Apr 9; 86(4):526-39. Epub 2010 Mar 18.* - Application of genome-wide SNP data for uncovering pairwise relationships and quantitative trait loci.[Genetica. 2009]
*Sham PC, Cherny SS, Purcell S.**Genetica. 2009 Jun; 136(2):237-43. Epub 2009 Jan 7.* - Applications of whole-genome high-density SNP genotyping.[Expert Rev Mol Diagn. 2005]
*Craig DW, Stephan DA.**Expert Rev Mol Diagn. 2005 Mar; 5(2):159-70.*

- Fine mapping of a QTL on chromosome 13 for submaximal exercise capacity training response: the HERITAGE Family Study[European journal of applied physiology. 201...]
*Rice TK, Sarzynski MA, Sung YJ, Argyropoulos G, Stütz AM, Teran-Garcia M, Rao DC, Bouchard C, Rankinen T.**European journal of applied physiology. 2012 Aug; 112(8)2969-2978* - fcGENE: A Versatile Tool for Processing and Transforming SNP Datasets[PLoS ONE. ]
*Roshyara NR, Scholz M.**PLoS ONE. 9(7)e97589* - Genetic variants in STAT4 and HLA-DQ genes confer risk of hepatitis B virus-related hepatocellular carcinoma[Nature genetics. 2013]
*Jiang DK, Sun J, Cao G, Liu Y, Lin D, Gao YZ, Ren WH, Long XD, Zhang H, Ma XP, Wang Z, Jiang W, Chen TY, Gao Y, Sun LD, Long JR, Huang HX, Wang D, Yu H, Zhang P, Tang LS, Peng B, Cai H, Liu TT, Zhou P, Liu F, Lin X, Tao S, Wan B, Sai-Yin HX, Qin LX, Yin J, Liu L, Wu C, Pei Y, Zhou YF, Zhai Y, Lu PX, Tan A, Zuo XB, Fan J, Chang J, Gu X, Wang NJ, Li Y, Liu YK, Zhai K, Zhang H, Hu Z, Liu J, Yi Q, Xiang Y, Shi R, Ding Q, Zheng W, Shu XO, Mo Z, Shugart YY, Zhang XJ, Zhou G, Shen H, Zheng SL, Xu J, Yu L.**Nature genetics. 2013 Jan; 45(1)72-75* - Reducing Pervasive False-Positive Identical-by-Descent Segments Detected by Large-Scale Pedigree Analysis[Molecular Biology and Evolution. 2014]
*Durand EY, Eriksson N, McLean CY.**Molecular Biology and Evolution. 2014 Aug; 31(8)2212-2222* - MHC associations with clinical and autoantibody manifestations in European SLE[Genes and immunity. 2014]
*Morris D, Fernando M, Taylor K, Chung S, Nititham J, Alarcón-Riquelme M, Barcellos L, Behrens T, Cotsapas C, Gaffney P, Graham R, Pons-Estel B, Gregersen P, Harley J, Hauser S, Hom G, Langefeld C, Noble J, Rioux J, Seldin M, Systemic Lupus Erythematosus Genetics Consortium, Vyse T, Criswell L.**Genes and immunity. 2014 Apr; 15(4)210-217*

- Cited in BooksCited in BooksPubMed Central articles cited in books
- MedGenMedGenRelated information in MedGen
- PubMedPubMedPubMed citations for these articles
- SNPSNPPMC to SNP links

- PLINK: A Tool Set for Whole-Genome Association and Population-Based Linkage Anal...PLINK: A Tool Set for Whole-Genome Association and Population-Based Linkage AnalysesAmerican Journal of Human Genetics. Sep 2007; 81(3)559PMC

Your browsing activity is empty.

Activity recording is turned off.

See more...