![]() |
Formats:
|
||||||||||||||||||
Copyright © Springer Science+Business Media, LLC 2007 The polymorphism architecture of mouse genetic resources elucidated using genome-wide resequencing data: implications for QTL discovery and systems genetics 1Department of Computer Science, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina 27599 USA 2Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina 27599 USA 3Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina 27599 USA 4Carolina Center for Genome Sciences, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina 27599 USA 5Bioinformatics and Computational Biology Training Program, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina 27599 USA 6Center for Environmental Health and Susceptibility and Center for Gastrointestinal Biology and Disease, University of North Carolina at Chapel Hill, Chapel Hill, 27599 USA 7Department of Genetics, University of North Carolina at Chapel Hill, CB# 7264, 103 Mason Farm Road, Chapel Hill, NC 27599-7264 USA David W. Threadgill, Email: dwt/at/med.unc.edu. Corresponding author.Received May 3, 2007; Accepted June 11, 2007. This article has been cited by other articles in PMC.Abstract Mouse genetic resources include inbred strains, recombinant inbred lines, chromosome substitution strains, heterogeneous stocks, and the Collaborative Cross (CC). These resources were generated through various breeding designs that potentially produce different genetic architectures, including the level of diversity represented, the spatial distribution of the variation, and the allele frequencies within the resource. By combining sequencing data for 16 inbred strains and the recorded history of related strains, the architecture of genetic variation in mouse resources was determined. The most commonly used resources harbor only a fraction of the genetic diversity of Mus musculus, which is not uniformly distributed thus resulting in many blind spots. Only resources that include wild-derived inbred strains from subspecies other than M. m. domesticus have no blind spots and a uniform distribution of the variation. Unlike other resources that are primarily suited for gene discovery, the CC is the only resource that can support genome-wide network analysis, which is the foundation of systems genetics. The CC captures significantly more genetic diversity with no blind spots and has a more uniform distribution of the variation than all other resources. Furthermore, the distribution of allele frequencies in the CC resembles that seen in natural populations like humans in which many variants are found at low frequencies and only a minority of variants are common. We conclude that the CC represents a dramatic improvement over existing genetic resources for mammalian systems biology applications. Introduction Since the derivation of the original inbred mouse strains from populations of fancy mice to investigate the genetic basis of cancer (reviewed in Paigen 2003), many additional inbred strains have been derived that harbor a tremendous amount of natural genetic variation (Beck et al. 2000; Ideraabdullah et al. 2004). However, unlike the more recently produced wild-derived strains, the vast majority of commonly used inbred strains trace their ancestry to the original mouse-fancier populations. An analysis of the genomes of extant inbred strains was recently made possible using data from a 15-strain resequencing project (http://www.mouse.perlegen.com/mouse/download.html), which revealed that the most widely used laboratory inbred strains are not random composites of the three main mouse subspecies (Mus musculus domesticus, M. m. musculus, and M. m. castaneus), but have a remarkably high level of shared ancestry largely contributed by the M. m. domesticus subspecies (Yang et al. 2007). Since many of the original inbred strains are also the most widely used in biomedical and laboratory research, the architecture of the genetic variation in derived resources is highly dependent on the interconnected and complex breeding histories of the progenitor inbred strains (Lyon et al. 1996). Over the last fifty years, numerous genetic resources have been devised and developed for specific purposes using a variety of inbred strains as progenitors (reviewed in Silver 1995). The major genetic resources that are widely used currently include recombinant inbred (RI) lines (Bailey 1971; Broman 2005), recombinant congenic strains (RCS) (Demant and Hart 1986), genome-tagged or congenic (CON) lines (Iakoubova et al. 2001), chromosome substitution strains (CSS) (Hudgins et al. 1985; Nadeau et al. 2000), heterogeneous stocks (HS) (Hitzemann et al. 1994), and, more recently, Laboratory Strain Diversity Panels (LSDP) drawn from the Mouse Phenome Project (Paigen and Eppig 2000) for association studies. Although the major use conceptualized for RI lines was linkage analysis (Bailey 1971), with the expanded sizes of many RI panels they are now being used to support analysis of more complex polygenic traits (Markel et al. 1996; Williams et al. 2001). Similarly, CSS and LSDP resources are being used for the genetic analysis of polygenic traits. CSS have a simplified genetic structure with only one chromosome differing between a single CSS and the parental recipient strain, a characteristic not shared with the other resources (Nadeau et al. 2000). The HS are significantly different than RI lines or CSS in that they typically contain multiple inbred strain progenitors, which potentially increases the level of genetic diversity represented in the resource (Yalcin et al. 2005). The LSDP were recently envisioned to adapt many of the whole-genome association technologies being developed by the human genetics community (Grupe et al. 2001; Bogue and Grubb 2004; Liao et al. 2004; Pletcher et al. 2004; McClurg et al. 2007; Payseur and Place 2007). In theory, the LSDP should encompass large amounts of variation, but in practice, since analyses of LSDP resources has largely been limited to panels of classical inbred strains, the diversity is most likely restricted to M. m. domesticus. Similar to LSDP resources, a more recently developed resource called the Collaborative Cross (CC) was designed to incorporate large amounts of variation (Threadgill et al. 2002; Churchill et al. 2004; Valdar et al. 2006). The CC is a mammalian genetic reference population that was designed to have controlled randomization of genetic factors, which is essential for causal inference. The CC was designed as a panel of recombinant inbred lines derived from eight parental inbred strains through a mating scheme that minimizes unpredictable genomic interactions between strains and optimizes the contribution from each parental strain. The selection of the parental strains was based upon historical breeding records and suspected relationships drawn from sparse maps of genetic variation. Herein we sought to reanalyze the structure of genetic variation present in various mouse genetic resources using genome resequencing data (http://www.mouse.perlegen.com/mouse/download.html). We found that the vast majority of resources capture very small amounts of the existing variation and the variation that is captured is not randomly distributed. Unlike other resources, the CC has a high level of variation capture that is normally distributed across the genome. This structure is similar to that found in humans and other randomly breeding mammalian species, showing that the CC is an ideal model for systems biology analyses. Materials and methods Genotype data All genotype data used in this study were obtained from the National Institute of Environmental Health Science’s “Resequencing and SNP Discovery Project” (http://www.niehs.nih.gov/crg/cprc.htm). These data contain over 109 million genotypes that identified 8.3 million SNPs spanning the 19 autosomes, the sex chromosomes, and the mitochondrial genome (http://www.mouse.perlegen.com/mouse/download.html). The 15 resequenced strains include 11 classical inbred strains (129S1/SvImJ, A/J, AKR/J, BALB/cBy, C3H/HeJ, DBA/2J, FVB/NJ, NOD/LtJ, BTBR T+tf/J, KK/HlJ, and NZW/LacJ) and four wild-derived strains (WSB/EiJ, PWD/PhJ, CAST/EiJ, and MOLF/EiJ), representing the M. m. domesticus, M. m. musculus, M. m. castaneus subspecies and M. m. molossinus, a subspecies that arose by natural hybridization between M. m. musculus and M. m. castaneus (http://www.jax.org). In addition, the genotypes of the fully sequenced and annotated C57BL/6J genome were used. Incomplete genotypes were imputed as described previously (Roberts et al. 2007). Mouse genetic resources One example was chosen from each of the five major types of resources based on widespread or potential use. In all cases the example represented the maximal amount of diversity captured among similar resources. The BXD, derived from C57BL/6J and DBA/2J by B. Taylor, L. Silver, and R. Williams, was chosen as the prototypical RI line panel because of its past and current popularity (Taylor 1978; Peirce et al. 2004). The representative chromosome substitution strain panel was B.P generated by J. Forejt, which has PWD/Ph chromosomes introgressed into the C57BL/6J background. The Northport HS derived from A/J, AKR/J, BALBc/J, CBA/J, C3H/HeJ, C57BL/6J, DBA/2J, and LP/J was used as the example of heterogeneous stock (Hitzemann et al. 1994). The Collaborative Cross is an RI line panel produced from the eight parental inbred strains A/J, C57BL6/J, 129S1/SvImJ, NOD/LtJ, NZO/HlLtJ, CAST/EiJ, PWK/PhJ, and WSB/EiJ (Threadgill et al. 2002; Churchill et al. 2004). Finally, since the emergence of the Mouse Phenome Project (Paigen and Eppig 2000), several panels of inbred strains have been considered for association studies (Bogue and Grubb 2004; Liao et al. 2004; McClurg et al. 2007; Payseur and Place 2007). The LSDP described by Payseur and Place was used as a representative because it is composed only of classical inbred strains, including A/J, AKR/J, BALB/cByJ, BTBR T +tf/tf, BUB/BnJ, CBA/J, CE/J, C3H/HeJ, C57BL/6J, C57BLKS/J, C57L/J, C57BR/cdJ, C58/J, DBA/2J, FVB/NJ, I/LnJ, KK/HIJ, LP/J, MA/MyJ, NOD/LtJ, NON/LtJ, NZB/B1NJ, NZW/LacJ, PL/J, RIIIS/J, SEA/GnJ, SJL/J, SM/J, SWR/J, and 129S1/SvImJ. Other inbred panels that also include wild-derived strains have not been useful for association mapping because of the large number of private polymorphisms contributed by strains derived from other subspecies. Strain substitutions Estimates of the polymorphism diversity captured by each resource represent best-case scenarios since they assume all diversity present in the parental strains is captured by the derived resources. Genetic diversity can be estimated directly in the BXD RI and the B.P CSS because the parental strains have been sequenced. In the remaining resources it was necessary to substitute sequenced strains for those that have not been sequenced. These substitutions were based on genetic similarity estimated using genotypes at SNPs distributed along the entire genome (Petkov et al. 2004). Five of the parental strains in the Northport HS have been sequenced and include A/J, AKR/J, C3H/HeJ, C57BL/6J, and DBA/2J. The remaining three strains were substituted by a sister substrain (BALBc/J was substituted by BALB/cBy), a related strain (LP/J was substituted by BTBR T+tf/J), or a Castle strain that will overestimate the diversity present in this panel (CBA/J was substituted by NZW/LacJ). Six of the parental strains in the Collaborative Cross have been sequenced: 129S1/SvImJ, A/J, C57BL/6J, NOD/LtJ, WSB/EiJ, and CAST/EiJ. The remaining two strains were substituted by strains from similar origins (NZO/HlLtJ was substituted by NZW/LacJ and PWK/PhJ was substituted by PWD/PhJ). Finally, for the LSDP we used all 12 classical inbred strains plus WSB/EiJ. Although the number of strains used for our analyses is significantly lower than in the original panel (Payseur and Place 2007), the WSB/EiJ strain is a larger contributor to the diversity than any single classical strain or group of classical inbred strains combined (Yang et al. 2007), suggesting that this will be an accurate representation of existing panels. Results The genetic diversity captured in the major mouse genetic resources depends on the number and identity of parental strains involved in their derivation, as well as the breeding design used to generate the resource (Fig. 1
Diversity captured is a function of the number of parental strains Most resources used in genetic studies are derived from crosses involving two parental strains or multiples thereof in order to introduce equivalent variation from each parental strain. Therefore, we used the mouse genome resequencing data to determine the range (maximum, minimum, and average) of diversity captured in any theoretical resource involving any 2, 4, 8, and 16 parental strains (Fig. 2
Diversity captured is a function of the subspecific origin of the parental strains A recent analysis of the mouse genome resequencing data demonstrates that over 92% of the genome of classical inbred strains is derived from the M. m. domesticus subspecies, and, unexpectedly, approximately 75% of the genome of MOLF/EiJ is of M. m. musculus origin (Yang et al. 2007). Based on these observations, it is possible to assign each of the 16 sequenced strains to a major subspecies (see Fig. 1
Spatial distribution of the diversity varies significantly among resources In addition to the total diversity captured, it is critical to consider how the variation captured in each resource is distributed across the genome. When such analyses are performed (Fig. 4
When the distribution of the variation captured is plotted in consecutive high-resolution intervals (Fig. 5
Allele frequency of the variation captured In addition to the level and distribution of the variation captured, the frequency of the minor alleles can impact the utility of a particular genetic resource. Therefore, to compare this characteristic among the different genetic resources, we determined the allele frequency present in the 8.3 million SNPs reported for the mouse genome resequencing project in the different resources considered in this study (Fig. 6
Discussion The recent explosion in genetic variation data for mice made possible by the resequencing of 15 mouse inbred strains (http://www.niehs.nih.gov/crg/cprc.htm) allows us to accurately determine and compare the polymorphic architecture of different mouse genetic resources. The most widely used resources suffer from very low rates of polymorphism capture (all extant RI lines, RCS, and the B.A and B.129 CSS) or medium levels of polymorphism capture that is nonuniformly distributed (B.P CSS, B6.CAST CON, Northport, and Boulder HS, and the LSDP). Although the proportion of the genome being interrogated with these resources does not limit their use for discovering subsets of functional gene variants controlling specific phenotypes, it greatly impairs their utility for genome-wide systems biological analyses. In addition, differences in allele frequency among the resources impact the relative allele strength that can be detected, with a consequential effect on the number of functional gene variants that can be detected by a particular resource. The common ancestry, dominated by M. m. domesticus, of many the strains that have contributed to most mouse genetic resources has resulted in a dramatic reduction in the pool of available gene variants for genome-wide discovery and, more importantly, may complicate their use for systems-level analyses of mammalian biology that is dependent on high levels of uniformly distributed genetic variation. The CC represents a resource that has optimal polymorphism architecture for system biological applications. In particular, the uniform distribution of the high level of variation captured is ideal to support global analysis of complex biological systems that is most efficiently achieved using experimental designs that employ multifactorial perturbations (Fisher 1935). Although the allele frequency distribution in the CC is not necessarily the best to detect the effects of any particular polymorphism, it is representative of natural populations and should outperform all resources for trait correlation analysis, which is the foundation of systems genetics, and all but the resources with only two parental strains in detection of specific gene functional variants. However, the resources with only two parental strains capture much lower levels of available polymorphisms, and the captured polymorphisms are not uniformly distributed, greatly reducing their genome-wide utility for systems biology applications. With the shift in complex trait gene discovery to humans that has been made possible by affordable high-density genotyping of large numbers of phenotyped individuals, the mouse will be taking a new role in biological research, that of a model to support mammalian systems biology investigations. Our analyses demonstrate that the CC represents a dramatic improvement over other genetic resources since it is the only resource that can serve this role based on the level, distribution, and allele frequency of captured polymorphisms. The overall performance of the CC is particularly remarkable given that the original choice of parental strains represented a compromise between the practical desire to take advantage of existing resources such as genome sequence, mapping panels, and ES cell lines and the ultimate goal of maximizing diversity (Churchill et al. 2004). Acknowledgments This work resulted from a collaborative effort by members of the UNC Computational Genetics Workgroup and was supported in part by the National Institute of General Medical Sciences as part of the Center of Excellence in Systems Biology (1P50 GM076468 to FPMV), the National Science Foundation (IIS 0448392 to WW), the Environmental Protection Agency (STAR RD832720 to WW), the Barry M. Goldwater Scholarship to AR, and the National Cancer Institute (1U01 CA105417 to DWT). Center support from the National Cancer Institute (5P30 CA016086), the National Institute of Environmental Health Sciences (2P30 ES010126), and the National Institute of Diabetes and Digestive and Kidney Diseases (5P30 DK034987) supported the collaborative environment. References
|
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||||
Genetics. 2003 Apr; 163(4):1227-35.
[Genetics. 2003]Nat Genet. 2000 Jan; 24(1):23-5.
[Nat Genet. 2000]Genome Res. 2004 Oct; 14(10A):1880-7.
[Genome Res. 2004]Transplantation. 1971 Mar; 11(3):325-7.
[Transplantation. 1971]Genetics. 2005 Feb; 169(2):1133-46.
[Genetics. 2005]Immunogenetics. 1986; 24(6):416-22.
[Immunogenetics. 1986]Genomics. 2001 May 15; 74(1):89-104.
[Genomics. 2001]J Immunol. 1985 Jun; 134(6):3849-54.
[J Immunol. 1985]Transplantation. 1971 Mar; 11(3):325-7.
[Transplantation. 1971]Mamm Genome. 1996 Jun; 7(6):408-12.
[Mamm Genome. 1996]Genome Biol. 2001; 2(11):RESEARCH0046.
[Genome Biol. 2001]Nat Genet. 2000 Mar; 24(3):221-5.
[Nat Genet. 2000]Genetics. 2005 Oct; 171(2):673-81.
[Genetics. 2005]Bioinformatics. 2007 Jul 1; 23(13):i401-7.
[Bioinformatics. 2007]BMC Genet. 2004 Apr 29; 5():7.
[BMC Genet. 2004]J Pharmacol Exp Ther. 1994 Nov; 271(2):969-76.
[J Pharmacol Exp Ther. 1994]Mamm Genome. 2002 Apr; 13(4):175-8.
[Mamm Genome. 2002]Nat Genet. 2004 Nov; 36(11):1133-7.
[Nat Genet. 2004]Mamm Genome. 2000 Sep; 11(9):715-7.
[Mamm Genome. 2000]Genome Res. 2004 Sep; 14(9):1806-11.
[Genome Res. 2004]Genetics. 2007 Apr; 175(4):1999-2008.
[Genetics. 2007]Nat Genet. 2000 Jan; 24(1):23-5.
[Nat Genet. 2000]Nat Genet. 2000 Jan; 24(1):23-5.
[Nat Genet. 2000]Nat Genet. 2000 Mar; 24(3):221-5.
[Nat Genet. 2000]BMC Genet. 2004 Apr 29; 5():7.
[BMC Genet. 2004]Mol Cell Biol. 1986 Dec; 6(12):4236-43.
[Mol Cell Biol. 1986]J Natl Cancer Inst. 1978 Oct; 61(4):1125-9.
[J Natl Cancer Inst. 1978]J Leukoc Biol. 1984 Sep; 36(3):357-64.
[J Leukoc Biol. 1984]Nat Genet. 2001 Mar; 27(3):234-6.
[Nat Genet. 2001]Nat Genet. 2001 Mar; 27(3):234-6.
[Nat Genet. 2001]Nat Genet. 2004 Nov; 36(11):1133-7.
[Nat Genet. 2004]