• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of pnasPNASInfo for AuthorsSubscriptionsAboutThis Article
Proc Natl Acad Sci U S A. Aug 19, 2003; 100(17): 9900–9905.
Published online Aug 4, 2003. doi:  10.1073/pnas.1633613100
PMCID: PMC187880
Genetics

Minimal haplotype tagging

Paola Sebastiani, Ross Lazarus,§ Scott T. Weiss,§ Louis M. Kunkel,||†† Isaac S. Kohane,‡‡ and Marco F. Ramoni‡‡§§

Abstract

The high frequency of single-nucleotide polymorphisms (SNPs) in the human genome presents an unparalleled opportunity to track down the genetic basis of common diseases. At the same time, the sheer number of SNPs also makes unfeasible genomewide disease association studies. The haplotypic nature of the human genome, however, lends itself to the selection of a parsimonious set of SNPs, called haplotype tagging SNPs (htSNPs), able to distinguish the haplotypic variations in a population. Current approaches rely on statistical analysis of transmission rates to identify htSNPs. In contrast to these approximate methods, this contribution describes an exact, analytical, and lossless method, called BEST (Best Enumeration of SNP Tags), able to identify the minimum set of SNPs tagging an arbitrary set of haplotypes from either pedigree or independent samples. Our results confirm that a small proportion of SNPs is sufficient to capture the haplotypic variations in a population and that this proportion decreases exponentially as the haplotype length increases. We used BEST to tag the haplotypes of 105 genes in an African-American and a European-American sample. An interesting finding of this analysis is that the vast majority (95%) of the htSNPs in the European-American sample is a subset of the htSNPs of the African-American sample. This result seems to provide further evidence that a severe bottleneck occurred during the founding of Europe and the conjectured “Out of Africa” event.

Keywords: single-nucleotide polymorphisms, association studies

Single-nucleotide polymorphisms (SNPs) are an invaluable tool to uncover the genetic basis of common diseases (1, 2) by providing a high-resolution map of the genome and allowing researchers to associate variations in a particular genomic region to observable traits. Unfortunately, the sheer number of SNPs in the human genome, which makes SNPs so useful as markers, also makes genomewide association studies unfeasible. However, the number of distinct combinations of SNP alleles (haplotypes) encountered in human samples is a small fraction of the possible haplotypes that would arise if alleles were distributed randomly. This haplotypic structure of the genome lends itself to the selection of a parsimonious set of SNPs, called haplotype tagging SNPs (htSNPs), able to distinguish the haplotypic variations in a population.

Given a set of haplotypes in a genomic region, identified through statistical (3, 4) or molecular (5, 6) methods, the process of haplotype tagging is in principle deterministic. Unfortunately, this problem is also computationally intractable (7), because its solution requires the testing of every possible combination of SNPs in the haplotype set, and the number of these combinations grows exponentially with the number of SNPs in the haplotype set. Current approaches rely on approximate methods to identify htSNPs. Most efforts (812) have focused on the identification of a secondary haplotype structure across several large regions of the genome. This substructure comprises regions of limited recombination, called haplotype blocks, bounded by small regions characterized by higher recombination rates. Within these smaller regions, htSNPs can be readily identified by eye or by brute-force search. An alternative approach (13) searches for htSNPs by maximizing the haplotype diversity “explained” by a set of SNPs. Using this method, Johnson et al. (13) were able to identify htSNPs accounting for up to 80% of the genomic variations in the populations they analyzed. Despite their differences, both block-based and direct approaches use stochastic methods to identify a reduced set of SNPs sufficient to characterize a genomic region in a population. A common concern about these approaches is that the loss of information induced by their stochastic nature could lead to overlooking rare variations responsible for less frequent diseases (14).

In contrast to these approximate approaches, we introduce the first exact, analytical, lossless solution to the problem of identifying the minimum set of SNPs accounting for the variations in an arbitrary genomic region. This method, called Best Enumeration of SNP Tags (BEST), does not follow a suboptimal heuristic or some approximate, stochastic approach but takes advantage of a peculiar aspect of the genome (the relatively small number of haplotypes with respect to the number of SNPs) to confine the source of complexity to a smaller search space. In this way, the reliability of the identified htSNPs will be only a function of the inferred haplotypes, and the haplotype tagging process will not induce any further information loss. Experimental results show that BEST runs to completion in a matter of seconds even for genomic regions containing >200 SNPs.

The method described in this contribution can take as input haplotypes inferred from cross-sectional samples via stochastic systems (3, 4) or from pedigree data. Therefore, we applied our method to both a set of 105 genes from 47 independent subjects and haplotypes for 9 genes from pedigree data described in ref. 13. Our results confirm that a small proportion of SNPs is sufficient to capture the haplotypic variations in a population and show that this proportion decreases exponentially as the number of SNPs in the haplotype increases. Comparing BEST to the method proposed by Johnson et al. (13), we also show that, in two genes of nine, our method finds smaller sets of htSNPs, suggesting that BEST improves their original results, even for comparatively small haplotypes.

Materials

Data Collection. SNP genotype data for 24 self-described African Americans (12 female) and 23 European Americans (11 female) were obtained for 105 genes: 85 genes from the University of Washington–Fred Hutchinson Cancer Research Center Variation Discovery Resource Program for Genomic Applications (http://pga.mbt.washington.edu) and 20 genes from the Innate Immunity Programs for Genomic Applications (http://innateimmunity.net). All sequencing was performed on the same anonymized DNA samples from the Coriell Cell Repositories (http://locus.umdnj.edu/nigms), using the same Big Dye Terminator sequencing chemistry and equipment (Applied Biosystems). Both sites used the same software and virtually identical protocols for base calling, assembly, and SNP determination, as detailed on each of the respective web sites.

We also tested our method by using published haplotypes, which included 5 genes in a maximum of 418 multiplex families from the Diabetes UK Warren 1 Repository, 3 genes in 598 subjects from Finnish families with at least one sibling diagnosed with type 1 diabetes, and one gene from United Kingdom blood donors (13).

Haplotype Identification. Haplotypes comprising all SNPs with minor allele frequency of ≥10% were inferred for each gene, independently in each ethnic sample, by using default settings of the phase program (3) for each of the 105 genes. To account for the inherent error rate of current genotyping technologies, we selected only those haplotypes seen more than once (frequency of >4%).

Methods

Preliminary Definitions. We regard a SNP as a variable bearing, at most, four states, one for each of the possible alleles (A, T, G, and C), although, in practice, SNPs with more than two states are relatively uncommon. This variable can also encode insertion (+) and deletion (–) polymorphisms. An allele is the assignment of a value to a SNP: we will say that, in a particular individual, the SNP in a particular locus bears the value. A haplotype is a set of contiguous alleles, as they appear in the population of interest. A haplotype set is a set of contiguous SNPs, and it is identified by a set of haplotypes found in the population of interest. Fig. 1 shows the unique haplotypes from SNPs with ≥10% rare allele frequency found in gene TLR7 from an African-American sample; columns represent SNPs, rows represent haplotypes, cells represent alleles, and the entire table is the haplotype set.

Fig. 1.
The haplotype set of gene TLR7 in an African-American sample. Each column represents a SNP, and each row represents a haplotype identified in the sample. In this case, there are 14 haplotypes spanning 59 SNPs. Color coding for each SNP is performed ...

A haplotype set can be a haplotype block, an entire gene, or an arbitrary genomic region. A SNP is derivable from a set of SNPs if its alleles are uniquely identified by a combination of alleles of the SNPs in such a set. A set of SNPs is sufficient to derive the SNPs in a haplotype set if all the SNPs in the haplotype set can be derived from the SNPs in such set. A set of SNPs is necessary to derive the SNPs in a haplotype set if, when one of its members is removed, at least one SNP in the haplotype set, including itself, is no longer derivable. A tagging set is the set of SNPs in a haplotype set necessary and sufficient to derive all of the SNPs in the set. Among all of the tagging sets, a minimal tagging set is a set containing the minimum number of SNPs, and its members will be called htSNPs. For a given haplotype set, there may be more than one minimal tagging set. Our goal is to find at least one of them.

Haplotype Tagging. Fig. 2 gives a skeletal description of the haplotype tagging algorithm BEST. The algorithm takes as input a set S of haplotypes, each representing a unique set of values of each SNP in the haplotype set, such as the haplotype set displayed in Fig. 1. The algorithm returns a minimal set of SNPs from which all of the other SNPs in the haplotype set can be derived. A preliminary step of the algorithm is to convert the haplotype set into binary form. The colors in Fig. 1 encode the binary conversion of a haplotype set. Although we consider here only the case of biallelic SNPs, this encoding can be easily generalized to triallelic SNPs and is currently implemented as such. When two SNPs share the same binary representation, they are termed binary equivalent. Any tagging set including a SNP will be equivalent to a tagging set where it is replaced by one of its binary equivalent SNPs. After the binary conversion, the algorithm will keep only one member of each group of binary equivalent SNPs.

Fig. 2.
Skeletal description of the BEST algorithm. A lowercase letter denotes a SNP, an uppercase letter denotes a set of SNPs, and a calligraphic uppercase letter denotes a set of sets. The symbol \ denotes the set-theoretic operation of subtraction, | ...

A fundamental property of this binary representation is that if a SNP is derivable from a set of other SNPs, then it will be derivable from any superset of such a set. This property spares the exponential effort of identifying the set of SNPs deriving a SNP. The second critical property exploited by BEST is that the tagging set identified by adding, at each step, the SNP that derives the maximum number of SNPs leads to SNP sets containing the minimal tagging set.

Let S = {s1,..., sm} denote the m SNPs in a haplotype set. BEST recursively partitions the set S in two groups H and D such that S = HD, where H is the minimal tagging set (the smallest set of SNPs necessary and sufficient to derive all of the SNPs in the haplotype set), D is the set of (mk) SNPs that are derivable from H, and a SNP dj is derivable from H if the value of dj in each haplotype can be expressed as a Boolean function f(.) of elements of H. A property of these Boolean functions is that if a SNP dj is derivable from a subset H′ of H, then dj is also derivable from any subset of H containing H′. This property follows from the fact that if a SNP is derivable from set of SNPs H′, then its alleles are uniquely identified by the allele combinations of the SNPs in H. Therefore, f(h1,..., hk) = f(h1,..., hk, hk+1, hk+2,...) for all h1,..., hk in H′. In this way, one can check whether a SNP sj is derivable from a subset of SNPs in S without necessarily knowing the specific subset, therefore avoiding the exponential cost of the search. In the skeletal description of the algorithm in Fig. 2, this operation is performed by the function DERIVABLE(x, Y), which returns true if the combinations of the alleles of a subset of the SNPs in the set Y uniquely identify the alleles of the SNP x. The function DERIVED(Y) returns the set of SNPs derivable from the set of SNPs Y. We can show that the cost of the function DERIVED(Y) is polynomial by noting that if a SNP x is derivable from a set of SNPs Y, it is also derivable by any superset of Y. Hence, to check the derivability of a SNP x from a set Y, we just check whether the alleles of at least one member of Y match the alleles of x. The function DERIVED(Y) is a simple iteration of the function DERIVABLE(x,Y) across the SNP set Y and can be executed in polynomial time.

The first step of the algorithm is the generation of the set H1 of htSNPs that are not derivable from any subset of S. This set is identified by examining the Boolean dependency of each sj on the set of SNPs S \ sj, and each sj that is not derivable is assigned to H1. The elements of H1 determine a partition of the remaining SNPs into equation M1. The set D1 contains the SNPs derivable from H1, whereas C1 is the set of SNPs that are not derivable from H1 and are therefore candidate htSNPs. Next, an augmentation procedure is applied to move one or more elements from C1 to H1 and from C1 to D1. First, the elements of C1 are sorted according to this criterion: equation M2 if the set D2i of SNPs derivable from equation M3 has cardinality greater than the cardinality of the set D2j of SNPs derivable from equation M4. If the criterion identifies only one c1 such that equation M5 for all other cj in C1, then equation M6, D2 = D21, and equation M7, and the procedure is repeated on the set C2 until the set of candidate SNPs is empty. When more than one set D2i of SNPs derivable from equation M8 has the same size of D21, then parallel partitions equation M9, D2i and equation M10 D2i) are generated, and the augmentation procedure is repeated on each of them. When none of the SNPs in C1 augments the set D1, then pairs of SNPs are treated as one single variable by the augmentation procedure. When the set of candidate SNPs is empty, if the augmentation procedure returns one or more necessary and sufficient sets H of htSNPs, then the algorithm stops. If no such set is found, the whole procedure is repeated on each of the sets H.

Proof of Optimality. We prove by induction on k, the minimum number of htSNPs, that the smallest set of necessary and sufficient htSNPs returned by BEST is the minimum set of htSNPs tagging the haplotype set at hand. Suppose first that k = 1, so that the minimum set of htSNPs consists only of one SNP. Because k = 1, then all elements of S are binary equivalent and BEST will return one of the m minimal tagging sets, each set given by {si}, for i = 1,..., m. Any of this set will be the minimum solution. Next, suppose the result is true for any minimum set of size k – 1 (that is, if the minimum set of htSNPs has size jk – 1, then we assume that BEST returns one of the minimum solutions of size j) and we show that the result is true when the minimum set consists of k htSNPs. More precisely, we assume that H = {h1,..., hn} is the set returned by BEST ordered according to the augmentation procedure, and we show that n = k. Suppose that we can decompose the set of SNPs S into equation M11, where Sk1 is the subset of S that is decomposed into equation M12, Dk1, Hk1 consists of the first k – 1 htSNPs in H, Dk1 is the set of SNPs that are derivable from Hk1, and Sk is the set of SNPs in S that are not derivable from Hk1. By induction, Hk1 is the minimum set of htSNPs for Sk1, and it is equivalent to any minimum set of htSNPs for Sk1. Furthermore, because we are assuming there is a minimum solution of size k, there exists at least one htSNP equation M13 in Sk that, once added to Hk1, makes of equation M14 the minimum set of htSNPs. Because the set of SNPs derivable from equation M15 is the whole set equation M16, then the cardinality of the set of SNPs derivable from equation M17 is the largest, and therefore equation M18 is the set of htSNPs found by BEST and n = k. If a decomposition equation M19 does not exist, then we can decompose equation M20, where equation M21 and Hki consists of the first (ki) SNPs, and repeat the same argument.

Results

BEST was first used to tag the 105 genes, ranging from 5 to 229 SNPs in length, described in Materials. The results of this analysis are summarized in Table 1. These results confirm that a small proportion of SNPs (14% for African Americans and 10% for European Americans) is sufficient to capture the variations in a haplotype set. It is interesting to note that the proportion of SNPs required to tag a gene decreases exponentially as the number of constituent SNPs increases. Fig. 3 shows the sharp exponential decay of the number of htSNPs as the size of the haplotype increases in both populations. This decay is due to the fact that, as the total number of SNPs increases, the observed haplotypes are likely to be a smaller ratio of the entire sample space, which grows exponentially with the number of SNPs. The algorithm takes advantage of the limited haplotype diversity in the genome to achieve an exponential saving in genotyping as the haplotype length increases.

Fig. 3.
Plots of the total number of SNPs in each gene against the ratio of SNPs required to tag it in the African-American (Left) and European-American (Right) samples. The steeper decay in the European-American sample is due to an increased number of binary ...
Table 1.
Results of the analysis of 105 genes using BEST

An interesting finding is that the majority of the htSNPs in the European-American sample also appear as htSNPs in the African-American sample. Because a minimal tagging set is not necessarily unique, these proportions can only be taken as a lower bound of the shared htSNPs. Still, an average of 95% (and in most cases 100%) of the htSNPs in the European-American sample are a subset of the htSNPs of the African-American sample. For 98 genes (94%), all of the htSNPs found in the European-American sample were also found in the African-American sample. This finding is strikingly identical across the vast majority of the 105 genes here considered and suggests that the lower variability of the European-American population is indeed the result of a depletion of an original gene pool, consistent with a severe bottleneck occurring during the founding of Europe and the proposed “Out of Africa” event (11, 15, 16).

An important result is that BEST successfully identifies the minimum set of htSNPs even in haplotype sets that would be unfeasible to tackle by exhaustive enumeration or prone to error by eye. We also analyzed nine genes described by Johnson et al. (13). Fig. 4 Left shows the results obtained by BEST and compares them with the results published in the original report. For two genes of nine, CASP10 and SDF1, BEST found multiple alternative htSNP sets smaller than those reported in the original report. For SDF1, the largest of the genes described in the original report spanning 22 SNPs, Johnson et al. determine that five htSNPs are required to tag haplotypes with frequency of >5%. In contrast, BEST reveals 10 equivalent sets of four htSNPs for their data. Even for the smaller gene, CASP10, with just 11 SNPs, BEST was able to identify four alternative sets of three htSNPs, against the single set of four htSNPs in the original report. These results suggest that, although their method is able to find sets of tagging SNPs, these are not necessarily optimal. In all genes, BEST identified equivalent SNPs. In our experience, knowing alternative equivalent htSNP sets is often valuable in practice when individual htSNPs prove difficult to genotype because of flanking repeat regions.

Fig. 4.
Results of the analysis of the data described by Johnson et al. (13). (Left) Summary of the results and a comparison with the results obtained in the original report. For each gene, the table reports the name, the number of SNPs, the number of haplotypes ...

Discussion

Haplotype-based studies are today considered one of the most promising approaches to discover the genetic basis of common diseases. One consequence of the haplotypic nature of the human genome is that only a subset of the SNPs in a haplotype will be sufficient to unambiguously distinguish the haplotypes. This feature of the genome promises to significantly reduce the number of SNPs required to completely genotype a sample and, in so doing, render feasible genomewide association studies. The identification of haplotype blocks created by the evolutionary history of the genome is an important step toward the identification of redundant SNPs, but the fulfillment of the promise of haplotype-based studies rests on the possibility of identifying which SNPs are actually able to tag a haplotype set with no information loss. This contribution described a feasible, exact, and lossless method able to identify such htSNPs and analytically tag an arbitrary stretch of the genome.

Current approaches focus on the identification of htSNPs based on linkage disequilibrium and on stochastic measures of haplotype diversity. Although these efforts provide useful insight into the natural history of the genome, we have shown that analytical haplotype tagging of arbitrary genomic regions is more efficient at identifying parsimonious sets of htSNPs. We believe that the ability of our method to identify the minimum tagging set for an arbitrary region of the genome can be instrumental in delivering on the promise of haplotype-based studies. Furthermore, the ability of our method to identify alternative minimal sets of htSNPs, when available, can be valuable in practice when htSNPs prove difficult to genotype. Coupling BEST with a map of human haplotypes would provide investigators with a powerful tool to design association studies.

A computer program implementing the method described here is available at http://genomethods.org/best.

Acknowledgments

We thank Emanuela Gussoni (Harvard Medical School), Stefano Monti (Massachusetts Institute of Technology/Whitehead Institute, Cambridge, MA), Alberto Riva (Harvard Medical School), and the referees for their insightful comments. This research was supported, in part, by National Science Foundation Grant ECS-0120309 (to M.F.R. and P.S.) and National Institutes of Health Grants HL-66795 (to S.T.W. and I.S.K.) and P01 NS40828 (to L.M.K. and I.S.K.). L.M.K. is an investigator of the Howard Hughes Medical Institute.

Notes

Abbreviations: BEST, Best Enumeration of SNP Tags; SNP, single-nucleotide polymorphism; htSNP, haplotype tagging SNP.

References

1. Lander, E. S. (1996) Science 274, 536–539. [PubMed]
2. Collins, F. S., Guyer, M. S. & Chakravarti, A. (1997) Science 278, 1580–1581. [PubMed]
3. Stephens, M., Smith, N. & Donnelly, P. (2001) Am. J. Hum. Genet. 68, 978–989. [PMC free article] [PubMed]
4. Niu, T., Qin, Z., Xu, X. & Liu, J. (2002) Am. J. Hum. Genet. 70, 157–169. [PMC free article] [PubMed]
5. Woolley, A. T., Guillemette, C., Cheung, C. L., Housman, D. E. & Lieber, C. M. (2000) Nat. Biotechnol. 18, 760–763. [PubMed]
6. Glatt, C. E., DeYoung, J. A., Delgado, S., Service, S. K., Giacomini, K. M., Edwards, R. H., Risch, N. & Freimer, N. B. (2001) Nat. Genet. 27, 435–438. [PubMed]
7. Garey, M. R. & Johnson, D. S. (1979) Computers and Intractability: A Guide to the Theory of NP-Completeness (Freeman, New York).
8. Patil, N., Berno, A. J., Hinds, D. A., Barrett, W. A., Doshi, J. M., Hacker, C. R., Kautzer, C. R., Lee, D. H., Marjoribanks, C., McDonough, D. P., et al. (2001) Science 294, 1719–1723. [PubMed]
9. Rioux, J. D., Daly, M. J., Silverberg, M. S., Lindblad, K., Steinhart, H., Cohen, Z., Delmonte, T., Kocher, K., Miller, K., Guschwan, S., et al. (2001) Nat. Genet. 29, 223–228. [PubMed]
10. Daly, M. J., Rioux, J. D., Schaffner, S. F., Hudson, T. J. & Lander, E. S. (2001) Nat. Genet. 29, 229–232. [PubMed]
11. Reich, D. E., Cargill, M., Bolk, S., Ireland, J., Sabeti, P. C., Richter, D. J., Lavery, T., Kouyoumjian, R., Farhadian, S. F., Ward, R. & Lander, E. S. (2001) Nature 411, 199–204. [PubMed]
12. Gabriel, S. B., Schaffner, S. F., Nguyen, H., Moore, J. M., Roy, J., Blumenstiel, B., Higgins, J., DeFelice, M., Lochner, A., Faggart, M., et al. (2002) Science 296, 2225–2229. [PubMed]
13. Johnson, G. C., Esposito, L., Barratt, B. J., Smith, A. N., Heward, J., Genova, G. D., Ueda, H., Cordell, H. J., Eaves, I. A., Dudbridge, F., et al. (2001) Nat. Genet. 29, 233–237. [PubMed]
14. Casci, T. (2002) Nat. Rev. Genet. 3, 573.
15. Reich, D. & Goldstein, D. (1998) Proc. Natl. Acad. Sci. USA 95, 8119–8123. [PMC free article] [PubMed]
16. Ingman, M., Kaessmann, H., Pääbo, S. & Gyllensten, U. (2000) Nature 408, 708–713. [PubMed]

Articles from Proceedings of the National Academy of Sciences of the United States of America are provided here courtesy of National Academy of Sciences
PubReader format: click here to try

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

  • MedGen
    MedGen
    Related information in MedGen
  • PubMed
    PubMed
    PubMed citations for these articles

Recent Activity

  • Minimal haplotype tagging
    Minimal haplotype tagging
    Proceedings of the National Academy of Sciences of the United States of America. Aug 19, 2003; 100(17)9900
    PMC

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...