• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of nihpaAbout Author manuscriptsSubmit a manuscriptNIH Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
Mamm Genome. Author manuscript; available in PMC Aug 12, 2009.
Published in final edited form as:
PMCID: PMC2725522
NIHMSID: NIHMS98745

An imputed genotype resource for the laboratory mouse

Abstract

We have created a high-density SNP resource encompassing 7.87 million polymorphic loci across 49 inbred mouse strains of the laboratory mouse by combining data available from public databases and training a hidden Markov model to impute missing genotypes in the combined data. The strong linkage disequilibrium found in dense sets of SNP markers in the laboratory mouse provides the basis for accurate imputation. Using genotypes from eight independent SNP resources, we empirically validated the quality of the imputed genotypes and demonstrate that they are highly reliable for most inbred strains. The imputed SNP resource will be useful for studies of natural variation and complex traits. It will facilitate association study designs by providing high density SNP genotypes for large numbers of mouse strains. We anticipate that this resource will continue to evolve as new genotype data become available for laboratory mouse strains. The data are available for bulk download or query at http://cgd.jax.org/.

Keywords: mouse, SNP, hidden Markov model, missing data

INTRODUCTION

The laboratory mouse owes much of its popularity as a model organism in biomedical research to the existence of a large collection of inbred strains that represent an immortal population of genetic clones derived by repeated brother sister mating (Lyon et al. 1996). Because mice from each strain are genetically identical it is possible to collect and combine biological data over time and space leading to a depth of phenotype characterization rarely achieved in other mammalian systems (Bogue 2003). Furthermore, the existence of a definite set of genetic differences among inbred strains allows scientists to explore the effect of genetic diversity on almost any phenotype of interest (Wade and Daly 2005). These studies require an accurate description of the level and distribution of genetic variation present among the hundreds of existing inbred strains. This is a challenging problem because the diversity between strains varies from extremely low levels found among sister substrains to very high levels found among strains derived from different species and subspecies (Petkov et al. 2004; Ideraabdullah et al. 2004; Yang et al. 2007).

Inbred strains can be classified into classical and wild-derived strains according to whether they were derived in the 20th century from a small set of founders known as “fancy” mice or derived from mice captured from natural populations more recently. Common wild-derived strains include representatives from two species Mus spretus and M. musculus, thought to have diverged almost 2 million years ago (Guenet and Bonhomme 2003). There is well over 1% sequence divergence between these species resulting in one SNP every 75bp (Ideraabdullah et al. 2004). Within the M. musculus species there are four subspecies, M. m. domesticus, M. m. castaneus. M. m. musculus and M. m. molossinus from which inbred strains have been derived. These subspecies are thought to have diverged 750,000 years ago (Guenet and Bonhomme 2003). There is roughly 1% divergence among subspecies corresponding to one SNP every 150bp (Ideraabdullah et al. 2004). Recent analysis of high density genotype data demonstrates that many wild-derived strain genomes carry regions of intersubspecific introgression (“contamination”) from a different subspecies (Yang et al. 2007). This analysis also confirms that classical strains are derived from multiple subspecies but that the contribution of M. m. domesticus represents over 90% of the genome in most of these strains. These data are critical to interpret the results of any mouse experiment in the proper evolutionary context and may have profound implications for our understanding of basic biological processes such as divergence, selection and speciation (Payseur and Hoekstra 2005; Mott 2007).

Since the genomic sequence of the C57BL/6 strain was reported (Waterston et al. 2002) much effort has been focused on the discovery and characterization of single nucleotide polymorphisms (SNPs) in inbred strains (Wade et al. 2002; Wiltshire et al. 2003; Yalcin et al. 2004; Pletcher et al. 2004; Frazer et al. 2004). Early SNP discovery projects carried out resequencing in a limited number of classical strains (Mural et al. 2002). More recently, the NIEHS used a hybridization based strategy to discover ~8.3 million SNPs in a survey of 15 inbred mouse strains, including four wild-derived strains representing the major subspecies of M. musculus (Frazer et al. 2007). In parallel to these SNP discovery efforts the Broad Institute of Harvard and MIT carried out genotyping of ~138,000 known SNPs on 49 inbred mouse strains (Wade and Daly 2005) and the Wellcome-CTC genotyped 499 inbred strains and outbred stocks at a lower SNP density with only ~13,370 SNPs (Shifman et al. 2006). Additional SNP resources are listed in Table 1.

Table 1
A summary of SNP resources. For each SNP resource, the name, the number of SNPs, the number of the strains reported, the genotyping technology used, and a literature citation are shown.

In the follow we refer to the NIEHS data as high density (>7 million genotyped SNPs). Medium density genotypes are similar in magnitude to the Broad set (>100,000 genotyped SNPs) and low density genotypes are similar in magnitude to Wellcome-CTC set (>10,000 genotyped SNPs). Much of biomedical research involves inbred strains for which the description of the diversity is based on low to medium density SNP panels (Liao et al. 2004; Pletcher et al. 2004; Cervino et al. 2005; Shifman et al. 2006; McClurg et al. 2007; Payseur and Place 2007). Linkage disequilibrium (LD) among classical inbred strains is extensive (Wade et al. 2002; Petkov et al. 2005), suggesting that we could leverage the NIEHS data to impute genotypes at high density in a larger set of inbred mouse strains. Achieving this goal should immediately empower hundreds of laboratories to narrow quantitative trait loci, help design the next generation of experiments in mammalian genetics and provide invaluable support in the field of comparative and evolutionary genomics for the study of biological processes such as recombination, mutation and selection (Dipetrillo et al. 2005; Siebert and Schadt 2007; Roberts et al. 2007).

We propose a method to impute genotypes at high density in strains for which only medium or low density genotype data are available. We apply this method to create a resource of SNP genotypes at ~7.9 million loci across 49 inbred strains by combining existing public databases and imputing missing genotypes. The quality of the imputed genotypes is quantified and empirically validated. We find that the imputed genotypes are most reliable for classical strains that have at least medium density genotyping data available. The accuracy of imputed genotypes is somewhat lower in wild-derived strains. We provide a confidence score that can be used to identify those imputed genotypes that are most reliable.

MATERIALS AND METHODS

Data preparation

Prior to combining databases, a multi-step quality control procedure was applied to the original NIEHS (http://mouse.perlegen.com/mouse/download.html, July 2006 release) and Broad (http://www.broad.mit.edu/~claire/MouseHapMap, February 2006 release) SNPs. First, we eliminated SNPs whose reported physical locations are impossible. We compared the genotypes and the 100-mer flanking sequences of all C57BL/6 SNPs in the data to the published mouse genome sequence NCBI build 36. This process remapped the NCBI build 33 Broad data to NCBI build 36 coordinates and removed those SNPs that revealed any discrepancy. A total of 75 Broad and 2,925 NIEHS SNPs were excluded. We then identified SNPs that are present in duplicated regions. We have previously observed that these SNPs can have very high false positive rates due to the detection of paralogous variation at other sites (Yang et al. 2007; unpublished data). We used BLAT (Kent 2002) under the most sensitive parameter settings to map all 25-mers centered in each SNP and defined a duplication as a SNP for which the 25-mer map to multiple genomic locations with less than three mismatches. We then used a sliding window to search for clustered duplications. Whenever two duplications were found out of four consecutive SNPs, the duplicated SNPs and any intervening SNPs were removed. A total of 448,999 SNPs (~5.4%) were removed. Of the remaining NIEHS SNPs, 15,068 were reported to have same genomic location. We kept one copy of each when all genotypes were fully consistent; otherwise, they were removed. For Broad data, we removed 2762 SNPs (~2%) that mapped to the duplicated locations in the NIEHS data. SNPs that mapped to identical genomic locations (redundant SNPs) were combined. During the process of combining databases, strand orientation adjustment was done whenever necessary. Conflicting genotypes were recoded as missing data. When more than half of the strains had discordant genotypes, the SNP locus was excluded.

Genotype Imputation

We use a hidden Markov model (HMM) with left to right architecture (Figure 1) to impute the missing genotypes. In this model, there are six hidden states (H = 6) representing different haplotypes at each SNP. State transitions proceed from one SNP (columns in figure 1) to the next according to a Markov process. The haplotypes of a strain can be viewed as a path through the model visiting one state per SNP locus, from the first SNP to the last on a given chromosome. Given a trained model and the genotypes of a strain, the path decoding problem is solved by Viterbi’s algorithm (Viterbi 1967). In Figure 1, the Viterbi paths are shown as colored lines. Strains with identical path through this region are grouped, but in general each strain will have its own unique path through the haplotype states. States have a probabilistic output, representing the observed genotype. Missing genotypes are imputed as the allele that is most likely to be emitted by the states along the Viterbi path. The most probable genotype for each state is indicated in the Figure 1. For every genotype, imputed and experimental, the posterior probability under the trained HMM serves as a confidence score, which is computed as the product of the posterior probability of the inferred haplotype state and genotype probability given the state.

Figure 1
HMM architecture. Each SNP locus is modeled using six hidden states representing haplotypes. Each state is labeled by the nucleotide that is most likely to be observed in that haplotype. Colored lines represent the most probable haplotypes (Viterbi paths) ...

Training the HMM involves estimation of a large number of free parameters. Parameter estimation is accomplished using the Expectation-Maximization (E-M) method as elaborated in Churchill (1989). The convergence criterion for the E-M algorithm is set as 10-6 change in the log-likelihood. Initial values for the EM algorithm are sampled at least 10 times and the training run that achieves the highest likelihood is chosen.

Despite the vast amount of data (millions of genotypes) in the training sets, this is a data poor problem. At any given SNP we have genotypes for only a small to moderate number of strains, distinct haplotypes may not be equally represented, and the information available in adjacent SNPs decays more or less rapidly depending on marker density and the extent of local linkage disequilibrium. The number of parameters in the HMM is large and it grows linearly with the number of SNPs; therefore the prior distributions can be influential and should be chosen carefully to obtain the best results.

Transition and emission probabilities are assumed to follow Dirichlet prior distributions. For state transitions, the prior density is biased towards the transitions between the same haplotype, with probability 1 - λ, and is equally distributed among the other (H-1) haplotypes. A prior that favors small values of λ will encourage the use of more information from adjacent SNPs. Emission probabilities are assumed to follow a uniform prior distribution for the two possible alleles. The Dirichlet pseudocount method is used to combine prior information with maximum likelihood estimates (Durbin et al. 1998).

A series of computational experiments was carried out to optimize the predictive accuracy of the HMM. We varied both the number of haplotypes and the prior parameters and assessed the accuracy of imputation by randomly masking portions of the genotype data. When H is too small, accuracy declines but we saw little improvement for these data when H is greater than 5 or 6. In genomic regions with fewer than six distinct haplotypes, a subset of the states will typically have small marginal probabilities and are effectively unused. Based on these studies, we chose H=6 and a prior mean transition rate of 0.01 as optimal values for imputation in the merged NIEHS-Broad data (Figure 2).

Figure 2
Relationships among SNP data sources. The sequence of C57BL/6J provides reference genotypes for all SNPs in this study. The NIEHS data on 15 strains and Broad data on 48 strains were combined to create a merged set of experimental SNPs. The HMM was trained ...

RESULTS

Combining large scale SNP panels

In order to create the imputed genotype resource, we first merged the SNP genotypes in the NIEHS and Broad data. We refer to these data as the merged set. The relationship of the merged set to other SNP sets used in this study is summarized in Figure 2.

The NIEHS data (http://mouse.perlegen.com/mouse/download.html, July 2006 release), include 109 million genotypes on ~8.3 million SNPs spanning the 19 autosomes, the X and Y chromosomes, and the mitochondrial genome, for 11 classical and four wild-derived strains. The genotypes were generated by Perlegen Sciences using high density oligonucleotide arrays. Approximately 54% of the NIEHS SNPs have missing genotypes for one or more of the 15 strains summing to 12% incomplete genotypes. The frequency of missing data is higher in the three wild-derived strains of non-domesticus origin (CAST, MOLF and PWD). We removed 5.5% of SNPs from the initial NIEHS set due to potential problems (see Materials and Methods and Table S1). The 7,804,762 remaining SNPs spanning the autosomes and the X chromosome were used in this study.

The Broad data (http://www.broad.mit.edu/~claire/MouseHapMap, February 2006 release) include over 6 million genotypes for 138,793 SNPs distributed at ~20kb intervals across the autosomes and the X chromosome, for 38 classical and 11 wild-derived strains representing four subspecies of M. musculus and two representatives of M. spretus. Genotypes were generated using custom Affymetrix SNP array technology. Approximately 68% of the SNPs have missing genotypes for one or more strains summing to 8% incomplete genotypes. The frequency of incomplete genotypes is higher for wild-derived strains with the exception of the two wild-derived M. m. domesticus strains, WSB and PERA. After mapping to NCBI build 36 coordinates and data preparation (see Materials and Methods), 135,846 SNPs were retained for this study.

The 15 NIEHS strains are common to both datasets and the remaining 33 strains are unique to the Broad data. Genotypes from the reference C57BL/6 genome sequence are included in the merged data. In total there are 116 million genotypes for 7,870,134 SNPs spanning the autosomes and the X chromosome of 49 strains.

In the process of merging datasets of dramatically different marker densities we assigned all of the unavailable genotypes to be missing. Thus there are two types of missing genotypes in the merged set. Experimental missing data are due to failure of a genotyping assay. Missing data created as a result of merging the data have, for the most part, not been directly assayed. In the merged set, 7,734,384 of the loci are assayed only in NIEHS data, 65,468 only in the Broad data and 70,282 loci have been assayed in both. There are a total of 14,078,485 experimental missing genotypes, 255,234,672 missing genotypes created by merging these data, and 6,318 missing values due to conflicts (see Methods). The frequency of incomplete genotypes is 98.3% for the 33 strains unique to the Broad set and ranges from 10% to 16% for the 15 strains common to both Broad and NIEHS data.

The imputed genotypes

We implemented a hidden Markov model (HMM) with a left to right architecture (Figure 1) for the primary purpose of genotype imputation and for the secondary purpose of haplotype identification. The architecture is similar to those described in Kimmel and Shamir (2005), and in Scheet and Stephens (2006). The number of hidden states at each SNP locus and the prior distribution of the model parameters of the HMM were optimized for genome-wide imputation accuracy (see Materials and Methods). All 49 strains in the merged set were used to train the model (see Materials and Methods). The posterior probability of each imputed genotype under the trained model provides a confidence score. A total of 269,319,475 missing genotypes were imputed in the merged set (merged-imputed set in Figure 2), of which 14.9%, 15.0%, and 70.1%, fall into the low, medium and high confidence score bins of (0,0.6), (0.6,0.9) and (0.9,1), respectively (Table S2).

In order to assess the quality of the imputed genotypes, we assembled a validation set of genotypes reported in eight SNP resources developed independently from the NIEHS and Broad sets (Figure 2, Table 1, Table S3). A total of 969,457 imputed genotypes could be validated using these resources, of which 15.4%, 13.7%, and 70.8%, fall into the low medium and high confidence score bins. The number of validated genotypes varies substantially among inbred strains (Table 2) and seven strains (129S4/SvJae, DDK/Pas, MAI/Pas, O20, Qsi5, ST/bJ and SEG/Pas) have no genotypes available for validation.

Table 2
Estimated strain specific error rates in the merged-imputed set. The table provides (1) the total number of imputed missing genotypes, the percentage of imputed genotypes that were missing due to merging datasets; (2) the total number of validated imputed ...

We compared imputed genotypes to genotypes in the validation set and conservatively assume that discordant genotypes represent imputation errors. The overall imputation error rate based on comparison with the validation set is 0.104 (Table 2). Error rates vary substantially among strains and to a lesser degree across chromosomes (Table S2). Strain specific error rates are lower for classical strains than for wild-derived strains. Furthermore, error rates vary for the two different types of missing data (Table 2). For imputed genotypes with high confidence scores, the error rate is 0.044 (Table 2). Among validated genotypes with high confidence scores wild-derived strains have higher error rates than classical strains. The NIEHS strains have higher error rates because most of the missing genotypes are experimental.

As a consequence of high levels of divergence between the classical and wild-derived strains, 59% of SNPs in the merged set are private to the wild-derived strains (i.e., the genotypes of those SNPs are constant within classical strains). We estimated error rates stratified by the status of a SNP being constant or polymorphic within the classical strains and found that they were essentially identical (Table S4). We note that, for strains A, DBA/2 and 129S1, the error rates within the constant SNPs (26% of the total validated SNPs) were elevated compared to the unstratified version (Table 2). These strains have a large number of validation genotypes available (Mural et al. 2002). This interesting pattern and the higher experimental error rates in the NIEHS strains suggest that gene conversion may be responsible for a large fraction of the imputation error.

To test how wild-derived strains impact the imputation accuracy among the classical strains, we retrained the HMM on a subset of the merged data including only the 38 classical strains and estimated the imputation error rates by comparison with the validation data. The impact on error rates varies across chromosomes and is most evident in chromosomes where there is a substantial contribution to the classical strains from M. m. musculus (Yang et al. 2007). The higher errors observed overall suggest that the inclusion of the wild-derived strains improves the imputation of missing genotypes in regions where some classical inbred strains are not of M. m. domesticus origin, without negatively impacting the rest.

Based on this empirical validation, we conclude that the quality of the imputed genotypes improves as information from more strains is utilized in training the HMM, and that the imputation of missing genotypes is highly reliable for most strains.

Imputation of genotypes in other mouse strains

The trained HMM can be used to infer high density genotypes for strains that are not included in the training set using the Viterbi algorithm (Viterbi, 1967). Strains NZO/HILtJ and PWK/PhJ are not in the merged set but are of interest because they are founder strains of the Collaborative Cross (Churchill et al. 2004). NZO is a classical inbred strain closely related to strain NZB of the merged set. PWK is a wild derived M. musculus strain that is most closely related to strain PWD in the merged set, although the overall sequence divergence between PWK and PWD is greater than between any pair of classical strains. We have assembled 140,269 genotypes and 132,862 genotypes for NZO and PWK, respectively, from four independent resources (Table S3; Tim Wiltshire, personal communication) and created a medium density SNP set by retaining only genotypes that correspond to Broad loci in the merged set. The remaining SNPs were used for validation (Table 3). Similarly, we created low density versions of NZO and PWK SNPs by selecting SNP loci that correspond to loci genotyped in the Wellcome-CTC study (Shifman et al, 2007). We ran the Viterbi algorithm and imputed missing genotypes for both strains at medium and low density. At medium density, NZO imputations yield a large proportion of high confidence SNPs; but at low density, confidence scores shift substantially downward. At medium density about one third of the imputed PWK SNPs have low confidence and at low density, nearly 80% of imputed SNPs fall into the lowest confidence category. The low confidence in the PWK imputations is likely due to the presence of M .m. musculus haplotypes in PWK that are not represented in the NIEHS strains. The validation accuracy of the imputed SNPs (Table 3) follows the same pattern as the confidence scores. Although the overall validation error rate of 38% for low density PWK imputations is unacceptably high, but for those SNPs that achieve the highest confidence scores (>0.9), chromosome specific error rates range from 1% to 8%. These results suggest that additional strains with medium density SNP genotyping can be accurately imputed and that the best overall results will be achieved for classical strains.

Table 3
Estimated error rates for imputed genotypes using SNP genotyping data at different densities. The table provides (1) the total number of genotyped SNPs; the proportion of the merged data; (2) the number of imputed genotypes, (3) the number of validated ...

The imputed genotype resource

To generate the most accurate imputation of genotypes on the 49 inbred mouse strains, we added the 969,457 SNP genotypes in the validation set to the merged set and created a combined set of SNP genotypes (Figure 2). We then retrained the HMM using 49 strains in the combined set and imputed the missing genotypes. The confidence score distribution of imputed genotypes from the combined set demonstrated a slight shift towards the higher confidence score bins, indicating increased accuracy with 14.1%, 14.7%, 71.2% of all imputed genotypes falling into the low, medium and high confidence score bins, respectively. In addition, we recomputed the imputation for two strains, NZO and PWK, using all of the available genotypes. These strains were not included in the training set. The complete set of imputed genotypes and confidence scores for 51 strains are available at http://cgd.jax.org/. The data can be downloaded or queried through a MYSQL database. The SNPs have been quality-checked, and cross-linked with other features, including ENSEMBL annotations, GO annotations, and MGI gene and phenotype information. We created a web interface that allows SNP retrieval filtered on features such as neighboring genes, genomic location, SNP functional implication, CpG sites, and substitution types.

DISCUSSION

We have created a data resource of experimental and imputed SNP genotypes at a density of 7,870,134 loci on 49 commonly used inbred mouse strains by combining data and imputing missing genotypes from two major public SNP collections. Our results support the hypothesis that strains with medium density genotyping can be accurately imputed to obtain high density genotypes. Confidence scores assigned to each genotype reflect the reliability of imputed genotypes and identify experimental genotypes that depart from expectations based on local LD. We have demonstrated the accuracy of imputed genotypes by comparison to experimental genotypes obtained independently.

Imputed genotypes are not a replacement for experimental measurements. We encourage investigators to use this resource as an exploratory tool, but critical conclusions based on imputed genotypes, should be validated. Low confidence imputations are more common in the wild-derived strains due the limited representation of appropriate taxa in the NIEHS strain panel. Nonetheless, the reliability of most genotypes in this resource is sufficient for high throughput analysis and hypothesis generating.

Extensive local LD, reflecting the small number of founders and the presence of admixture in laboratory mice, provides the essential structure that allows accurate imputation of missing genotypes. To achieve this, the density of SNP genotyping should be sufficient to tag most regions of local LD in the population of strains to which the imputation algorithm is being applied. Imputation may be unreliable if a novel haplotype, not present in the high density training data, is encountered. Furthermore, SNPs that were not identified in the discovery process are not represented in the imputed data resource. We have previously estimated the false negative rate in the NIEHS data to be 67% (the false negative rate in the classical strains is 43%) but the rate is significantly higher for singleton SNPs (Yang et al. 2007). Therefore, absence of a SNP in this, or any, resource does not imply genetic identity. False negatives are of concern when the missing SNPs are private to one strain or to a small group of related strains. Discovery bias will significantly impact our ability to accurately impute SNPs in inbred strains derived from diverse mouse lineages. The solution is to carry out more SNP discovery at high density. The report of the sequence of multiple Drosophila species highlights the importance and benefits of resequencing and SNP discovery in diverse taxa (Drosophila 12 Genomes Consortium 2007). Similarly, it will be particularly useful to identify lineage specific SNPs and to genotype additional representatives of lineages with high error rates such as M. m. castaneus and M. spretus.

During the data preparation phase of this project we intentionally removed SNPs from regions with evidence of multiple copies in the C57BL/6 genome due to higher error rates observed in these SNPs (unpublished) and there are variable repeat regions present in other strains. Repeats and copy number variation need to be included to achieve a complete understand of the landscape of genetic variation in the laboratory mouse.

Other important features of genetic variation, such as gene conversion and recurrent mutations, may be missed because they are not consistent with the local LD pattern. In fact, the patterns of errors observed in the Celera strains suggest that both processes contribute significantly to the imputation error. This situation can be solved through additional experimental determination of missing genotypes. Finally, genotyping errors in the training data present a major barrier to improving the accuracy of imputed genotypes. Identification and resolution of genotyping errors in data as extensive as these is a daunting task. Confidence scores obtained from the HMM can suggest potential genotyping errors, but not all genotyping errors can be detected in this way. New computational approaches to error detection would be beneficial.

We used an empirical approach to validate imputed genotypes. Our estimated error rates are likely to be conservative because of genotyping errors in the validation set. An alternative method to assess imputation accuracy (Roberts et al. 2007) is to randomly mask a proportion of the genotypes within the combined dataset and to compare the imputed values to the masked genotypes. We found that the random masking approach provides higher estimated accuracy compared to the empirical validation. The processes that lead to missing data are likely to be more complex than the masking model, thus we prefer the empirical estimates but acknowledge their conservative bias.

The imputed genotype resource must be viewed dynamically because the coverage of strains, the number of SNP loci, and the accuracy of imputation will improve with additional data from ongoing genotyping projects. In conclusion, this study reports a method for accurate imputation of missing genotypes and its use in generating a dense map of the genetic variation in the mouse genome. Our results support the proposal by Frazer and coworkers (2007) that such a resource could and must be generated.

Acknowledgments

This work was supported by the US National Institutes of General Medical Sciences as part of the Center of Excellence in Systems Biology (1P50 GM076468). We thank Tim Wiltshire for sharing genotyping data prior to its publication, Jesse Hammer and Susan Moxley for graphics assistance.

References

  • Abe K, Noguchi H, Tagawa K, Yuzuriha M, Toyoda A, et al. Contribution of Asian mouse subspecies Mus musculus molossinus to genomic constitution of strain C57BL/6J, as defined by BAC-end sequence-SNP analysis. Genome Res. 2004;14:2439–2447. [PMC free article] [PubMed]
  • Bogue MA. Mouse Phenome Project: understanding human biology through mouse genetics and genomics. J Appl Physiol. 2003;95:1335–1337. [PubMed]
  • Cervino AC, Li G, Edwards S, Zhu J, Laurie C, et al. Integrating QTL and high-density SNP analyses in mice to identify Insig2 as a susceptibility gene for plasma cholesterol levels. Genomics. 2005;86:505–517. [PubMed]
  • Churchill GA, Airey DC, Allayee H, Angel JM, Attie AD, et al. The Collaborative Cross, a community resource for the genetic analysis of complex traits. Nat Genet. 2004;36:1133–1137. [PubMed]
  • Churchill GA. Stochastic models for heterogeneous DNA sequences. Bulletin of Mathematical Biology. 1989;51:79–94. [PubMed]
  • DiPetrillo K, Wang X, Stylianou L, Pagien B. Bioinformatics toolbox for narrowing rodent quantitative trait loci. Trends Genet. 2005;21:684–692. [PubMed]
  • Durbin R, Eddy SR, Krogh A, Mitchison G. Biological sequence analysis. Cambridge University Press; Cambridge, UK: 1998.
  • Drosophila 12 genomes consortium. Evolution of genes and genomes on the Drosophila phylogeny. Nature. 2007;450:203–218. [PubMed]
  • Frazer KA, Wade CM, Hinds DA, Patil N, Cox DR, et al. Segmental phylogenetic relationships of inbred mouse strains revealed by fine-scale analysis of sequence variation across 4.6 Mb of mouse genome. Genome Res. 2004;14:1493–1500. [PMC free article] [PubMed]
  • Frazer KA, Eskin E, Kang HM, Bogue MA, Hinds DA, et al. A sequence-based variation map of 8.27 million SNPs in inbred mouse strain. Nature. 2007;448:1050–1053. [PubMed]
  • Guenet JL, Bohomme F. Wild mice: an ever-increasing contribution to a popular mammalian model. Trends Genet. 2003;19:24–31. [PubMed]
  • Ideraabdullah FY, de la Casa-Esperon E, Bell TA, Detwiler DA, Magnuson T, et al. Genetic and haplotype diversity among wild derived mouse inbred strains. Genome Res. 2004;14:1880–1887. [PMC free article] [PubMed]
  • Kent WJ. BLAT--the BLAST-like alignment tool. Genome Res. 2002;12:656–664. [PMC free article] [PubMed]
  • Kimmel G, Shamir R. A block-free hidden Markov model for genotypes and its application to disease association. J Comput Biol. 2005;12:1243. [PubMed]
  • Liao G, Wang J, Guo J, Allard J, Cheng J, et al. In silico genetics: identification of a functional element regulating H2-Ealpha gene expression. Science. 2004;306:690–695. [PubMed]
  • Lyon MF, Rastan S, Brown SDM, editors. Genetic variants and strains of the laboratory mouse. 3. Oxford Univeristy Press; Oxford, UK: 1996.
  • McClurg P, Janes J, Wu C, Delano DL, Walker JR, et al. Genomewide association analysis in diverse inbred mice: power and population structure. Genetics. 2007;176:675–683. [PMC free article] [PubMed]
  • Mott R. A haplotype map for the laboratory mouse. Nat Genet. 2007;39:1054–1056. [PubMed]
  • Mural RJ, Adams MD, Myers EW, Smith HO, Miklos GL, et al. A comparison of whole-genome shotgun-derived mouse chromosome 16 and the human genome. Science. 2002;296:1661–1671. [PubMed]
  • Payseur BA, Hoekstra HE. Signatures of reproductive isolation in patterns of single nucleotide diversity across inbred strains of mice. Genetics. 2005;171:1905–1016. [PMC free article] [PubMed]
  • Payseur BA, Place M. Prospects for association mapping in classical inbred mouse strains. Genetics. 2007;175:1999–2008. [PMC free article] [PubMed]
  • Pletcher MT, McClurg P, Batalov S, Su AI, Barnes SW, et al. Use of a dense single nucleotide polymorphism map for in silico mapping in the mouse. PLoS Biol. 2004;2:2159–2169. [PMC free article] [PubMed]
  • Petkov PM, Graber JH, Churchill GA, DiPetrillo K, King BL, et al. Evidence of a large-scale functional organization of mammalian chromosomes. PLoS Genet. 2005;1:e33. [PMC free article] [PubMed]
  • Petkov PM, Ding Y, Cassell MA, Zhang W, Wagner G, et al. An efficient SNP system for mouse genome scanning and elucidating strain relationships. Genome Res. 2004;14:1806–1811. [PMC free article] [PubMed]
  • Roberts A, McMillan L, Wang W, Parker J, Rusyn I, et al. Inferring missing genotypes in large SNP panels using fast nearest-neighbor searches over sliding windows. Bioinformatics. 2007;23:i401. [PubMed]
  • Roberts A, Pardo-Manuel de Villena F, Wang W, McMillan L, Threadgill DW. The polymorphism architecture of mouse genetic resources elucidated using genome-wide resequencing data: implications for QTL discovery and systems genetics. Mamm Genome. 2007;18:473–481. [PMC free article] [PubMed]
  • Siebert SK, Schadt EE. Moving toward a system genetics view of disease. Mamm Genome. 2007;18:389–401. [PMC free article] [PubMed]
  • Scheet P, Stephens M. A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. Am J Hum Genet. 2006;78:129. [PMC free article] [PubMed]
  • Shifman S, Bell JT, Copley RR, Taylor MS, Williams RW, et al. A high-resolution single nucleotide polymorphism genetic map of the mouse genome. PLoS Biol. 2006;4:e395. [PMC free article] [PubMed]
  • Viterbi AJ. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Trans Information Theory. 1967;13:260–269.
  • Wade CM, Kulbokas EJ, 3rd, Kirby AW, Zody MC, Mullikin JC, et al. The mosaic structure of variation in the laboratory mouse genome. Nature. 2002;420:574–578. [PubMed]
  • Wade CM, Daly MJ. Genetic variation in laboratory mice. Nat Genet. 2005;37:1175–1180. [PubMed]
  • Waterston RH, Lindblad-Toh K, Birney E, Rogers J, Abril JF, et al. Initial sequencing and comparative analysis of the mouse genome. Nature. 2002;420:520–562. [PubMed]
  • Wiltshire T, Pletcher MT, Batalov S, Barnes SW, Tarantino LM, et al. Genome-wide single-nucleotide polymorphism analysis defines haplotype patterns in mouse. Proc Natl Acad Sci USA. 2003;100:3380–3385. [PMC free article] [PubMed]
  • Yalcin B, Fullerton J, Miller S, Keays DA, Brady S, et al. Unexpected complexity in the haplotypes of commonly used inbred strains of laboratory mice. Proc Natl Acad Sci USA. 2004;101:9734–9739. [PMC free article] [PubMed]
  • Yang H, Bell TA, Churchill GA, Pardo-Manuel de Villena F. On the subspecific origin of the laboratory mouse. Nat Genet. 2007;39:1100–1107. [PubMed]
PubReader format: click here to try

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...