![]() | ![]() |
Formats:
|
||||||||||||||||
Mouse Segmental Duplication and Copy-Number Variation 1Department of Genome Sciences, University of Washington, Seattle, WA 98195 2Department of Biostatistics and Department of Psychiatry, University of Michigan, Ann Arbor, MI 48109 3National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Rockville Pike, Bethesda, MD 20894 4Howard Hughes Medical Institute, Seattle, WA 98195 †Corresponding author: Evan Eichler, Ph.D. Howard Hughes Medical Institute and University of Washington School of Medicine Department of Genome Sciences Box 355065 Foege S413C, 1705 NE Pacific St. Seattle, WA 98195 E-mail: eee/at/gs.washington.edu Abstract Detailed analyses of the clone-based genome assembly reveal that the recent duplication content of mouse (4.94%) is now comparable to that of human (5.5%), in contrast to previous estimates from the whole-genome shotgun sequence assembly. The architecture of mouse and human genomes differ dramatically; most mouse duplications are organized into discrete clusters of tandem duplications that are depleted for genes/transcripts and enriched for LINE and LTR retroposons. We assessed copy-number variation of the C57BL/6J duplicated regions within 15 mouse strains used for genetic association studies, sequencing, and the Mouse Phenome Project. We determined that over 60% of these basepairs are polymorphic between the strains (on average 20 Mbp of copy-number variable DNA between different mouse strains). Our data suggest that different mouse strains show comparable, if not greater, copy-number polymorphism when compared to human; however, such variation is more locally restricted. We show large and complex patterns of inter-strain copy-number variation restricted to large gene families associated with spermatogenesis, pregnancy, viviparity, pheromone signalling, and immune response. Initial estimates suggested that 1-2% of the mouse genome1-3 consisted of high identity (>90%) duplications. These estimates, however, were complicated by the whole-genome shotgun sequence assembly (WGSA) method, which cannot resolve large, highly identical duplications. In particular, the largest (>15 kb) and most identical (>97%) duplicated segments4 are often missing, collapsed, or mis-assigned as part of WGSA draft assemblies. Missing duplications, for example, are thought to result from difficulties in assembling regions of the genome where there is an excess of sequence mate-pair violations due to paralogous sequences. As the mouse genome assembly has progressed from WGSA to an ordered BAC-based assembly, the segmental duplication (SD) content has gradually increased5,6. Accurate resolution of the duplicated regions is particularly critical as some of these regions have been shown to be highly variable in copy-number between commonly related strains of mice7-11, enriched in lineage-specific gene families undergoing positive selection12,13, and preferential sites of large-scale rearrangement associated with chromosome evolution in the rodent lineage6,14-16. Here we present a detailed analysis of the recent duplication content of the mouse clone-based finished genome assembly and assess copy-number variation (CNV) of these regions in 15 different inbred strains of mice. The results suggest distinct properties of mouse SDs when compared to human and reveal previously unrecognized complex patterns of structural variation. RESULTS A self-comparison of the current mouse assembly genome (Build36) identifies 141.4 Mbp of SD (>1 kbp in length and >90% identity) (See Supplementary Note for details). We confirmed 96% (83.14/86.63 Mbp) of the largest (>10 kbp) and most identical (>94%) duplications using a previously described detection strategy that is independent from the assembly2,17. As a second measure of validation, we examined a total of 24 large-insert clones that had been shown to produce multi-site signals by FISH on C57BL/6J metaphase chromosomes2,8. Of the corresponding sequences, 23/24 were confirmed as duplicated by at least one of our measures for duplication (Supplementary Table 1). Using only the assembly-based comparison, we found that the majority (21/24) carried more than 40% duplicated basepairs attesting to the high quality of the mouse assembly (Supplementary Table 1). In total, if we consider all pairwise alignments (<94% identity) and all those (>94% sequence identity) that are confirmed by two independent methods, we calculate the SD content of the mouse genome to be 4.94%. This value represents a 2- to 3-fold increase from previous estimates1-3. The availability of the human and mouse genomes as clone-ordered BAC-based sequence assemblies provides the first opportunity to systematically compare SD sequence properties for two mammalian genomes (Table 1). Both genomes show similar levels of duplication (~5%) distributed in a highly non-random fashion (Supplementary Fig. 1). We find that recent mouse duplications are restricted to fewer genomic locations, with a total of 149 mouse duplication blocks (Table 1, Fig. 1
As noted previously2,5,8, there are few examples of large interchromosomal duplication (Table 1) and most large (>10 kb) intrachromosomal duplications are tandemly organized with >89% of the pairwise alignments mapping in close proximity to one another (Fig. 2
The enrichment of Alu-SINE repeat elements at the boundaries of new human segmental duplications has been taken as evidence that these elements played a role in the dispersal of SDs in the ancestral primate genome19,20. We examined the repeat composition of mouse segmental duplications and found them significantly enriched for both LINE and repeat elements (1.5- to 2-fold enrichment) (Supplementary Table 3, Fig. 3
Numerous studies in different organisms have shown that segmental duplications are enriched 4- to 10-fold for copy-number variation9,21-23 although such variation also occurs outside regions of SD. Using our duplication map of the mouse genome, we specifically focused on the design of a customized high-density oligonucleotide array (average 1 probe/481 bp) targeted to C57BL/6J SDs that were confirmed by both computational methods (Supplementary Note). As a control, we also selected 273 regions that had been predicted to be copy-number variant based on earlier BAC-arrayCGH experiments (Supplementary Table 5). We selected 15 inbred strains of mice based on their genealogical relationship to C57BL/6J or use as NIEHS sequencing strains/Mouse Phenome Project. All arrayCGH experiments were performed using C57BL/6J as the reference strain. Based on the raw log2 signal intensity data24, striking CNV was observed between the C57BL/6J and the other inbred strains (Fig. 4a
When comparing all 15 strains against the C57BL/6J reference, we identify in total 2,424 CNV sites (1,259 gains and 1,958 losses). 56% of these CNV events in each strain are predicted as high-confidence intervals (p>0.8)—and of these 85~92% are novel when compared to previous reports (Supplementary Table 6, Supplementary Table 7, Table 2). Most of the variation in segmental duplications was not detected previously as probes were under-represented 10-fold when compared to unique regions and 50-fold when compared to our C57BL/6J duplication-specific microarray (Supplementary Table 8)10. Even among the confirmed sites of CNV, we observe significantly more substructure than previously reported, revealing a complex pattern of copy-number gain and loss associated with mouse segmental duplications (Fig. 4b
We identified 353 genes embedded within SDs that showed either gain or loss (Supplementary Table 6). Of these, 194 CNV intervals are sufficiently large enough to affect the entire gene, including 31 genes showing both gains and losses in different strains with respect to C57BL/6J (Supplementary Table 6). Several of the copy-number variant genes are associated with spermatogenesis, pregnancy, and viviparity (e.g. Spetex, Xmr, Tcte, Ott, prolactin/proliferin, Il11ralpha)25,26. Other gene families associated with pheromone response show large-scale CNV between the strains (e.g. vomeronasal receptor (V2r and V1r)27 and major urinary proteins (Mup) gene families28). Similar to the human genome, immune response genes show extensive copy-number polymorphism. For example, the defensin genes (Defcr21, 22, 23, Def5b1), neuronal apoptosis inhibitory protein (Naip) gene family and killer cell lectin-like receptor family a (Klra) are all part of CNV duplication blocks associated with strain variability to infection29-31. DISCUSSION Although similar in proportion (~5 %), recent mouse genomic duplications, in contrast to humans, are organized into discrete clusters of tandem duplications that are depleted for genes/transcripts and enriched for LINE and LTR retroposons. We hypothesize that the strong association with younger LINE elements, as opposed to primate Alu SINE elements, might explain some of the key differences between human and mouse duplications. For example, LINE repeat sequences preferentially map to AT-rich, gene-poor regions due the sequence preference of the RT-endonuclease32. Similar bias against genes has been observed for LTR elements33. If LINE/LTR sequences promote segmental duplication, it may explain why there is a deficiency of genes/transcripts in mice, while in humans the trend is in the opposite direction (i.e. segmental duplications associate with SINE-rich, gene-rich regions of the genome)32. In addition, we find that mouse duplicated sequences have 3 to 4 times as many paralogs when compared to human. We conservatively estimate that at least 20 Mb of segmental duplication is copy-number variable between strains (Table 2). When compared to recent surveys of copy-number variation in humans34,35, we find that different strains of mice show as much, if not more, copy-number variable DNA within the duplicated regions. We propose that the larger number of local pairwise alignments in tandem orientation within the mouse increases the potential for non-allelic homologous recombination and, thus, the mutation frequency. In this regard, it is interesting that of the 15 CNVs that intersect with Egan and colleagues, 14/15 were shown to occur recurrently within mouse strains10. Those with the highest frequency of new mutation (~1 spontaneous mutation per 100 newborns) are composed almost entirely (77-92%) of segmental duplications (Supplementary Table 7). Further studies of the normal pattern of copy-number variation within wild outbred lines of mice and sequencing of additional murid genomes will be necessary to assess the generality of these findings. METHODS DNA Samples All spleen-derived DNA samples were obtained from male individuals representing 15 inbred strains of mice (Jackson Laboratory). These included: C57BL/6J, DBA/2J, A/J, C57BL/10J, CZECHI/EiJ, CAST/EiJ, BPH/2J, BALB/cByJ, C57BLKS/J, 129S1/SvlmJ, DDY/JclSidSeyFrkJ, C57BR/cdJ, C57BL/6ByJ, NZO/HiLtJ, and NOD/L5J. The reference sample in all these experiments was C57BL/6J (Prep#37347, a G227 male individual born Oct. 4, 2005). As a control, an arrayCGH experiment was performed against a second C57BL/6J individual (Prep#37579, a G230 male individual born Sept. 27, 2006). Inbred strains were selected in an effort to sample genetic diversity36 and to include strains from the Mouse Phenome Project and NIEHS sequencing projects. Segmental Duplication Characterization Two independent approaches were used to detect segmental duplications: WGAC (whole-genome assembly comparison) is a BLAST-based analysis of all assembled sequence that detects self-alignments (>90% and 1 kb); WSSD (whole-genome shotgun sequence detection) is an assembly independent approach that examines the reference sequence for an increase in WGS read depth-of-coverage (WSSD-DOC) and/or increase in the divergence read ratio (WSSD-DRR). We mapped 40,782,208 sequence reads against the Build36 genome assembly as part of the mouse WSSD analysis. We estimated the duplication content of the mouse genome based on the sum of low-identity WGAC (<94%) and high-identity WGAC (>10 kb, >94%) that were confirmed by the union of WSSD-DOC and WSSD-DRR estimates. Repeat content and subfamily designation was determined using RepeatMasker. Significance was determined by permutation (randomly sampling the genome and computing an enrichment greater or equal to that observed within regions classified as segmentally duplicated. All underlying segmental duplication analysis data are available from http://mouseparalogy.gs.washington.edu and have been placed as customized tracks on the UCSC browser and the NCBI MapViewer for Build36. Array Comparative Genomic Hybridization and CNV Detection We designed a customized oligonucleotide microarray platform for array comparative genomic hybridization (NimbleGen). We targeted 385,000 probes to 159.4 Mb regions of the mouse genome assembly (Build36) where segmental duplications and/or CNVs were previously identified, as indicated in Table 2. Probe design and the sample hybridization were performed at NimbleGen (Madison, WI, USA) using standard tiling array protocol. We identified copy-number variant regions between mouse strains using a novel HMM (see Supplementary Note for detailed description and software availability). 1 Supp Note Click here to view.(65K, pdf) 2 Supp Fig S1 Click here to view.(205K, pdf) Supp Table S1 Click here to view.(30K, pdf) Supp Table S2 Click here to view.(11K, pdf) Supp Table S3 Click here to view.(21K, pdf) Supp Table S4 Click here to view.(113K, pdf) Supp Table S5 Click here to view.(26K, pdf) Supp Table S6 Click here to view.(153K, pdf) Supp Table S7 Click here to view.(19K, pdf) Supp Table S8 Click here to view.(12K, pdf) ACKNOWLEDGEMENTS We thank Dr. Lucy Rowe, Connie Birkenmeier, and Gary Churchill for providing additional information regarding the relatedness of different inbred strains of mice used in this study. We thank Anne Morrison for DNA sample preparation. We thank Tonia Brown, Kari Augustyn and Heather Mefford for assistance in preparation of this manuscript. Footnotes Accession number. Microarray data have been deposited in the Gene Expression Omnibus database under accession number GSE11369. REFERENCES 1. Cheung J, et al. Recent segmental and gene duplications in the mouse genome. Genome Biol. 2003;4:R47. [PubMed] 2. Bailey JA, Church DM, Ventura M, Rocchi M, Eichler EE. Analysis of segmental duplications and genome assembly in the mouse. Genome Res. 2004;14:789–801. [PubMed] 3. Bailey JA, Eichler EE. Genome-wide detection of segmental duplication within mammalian organisms. In: Ebert J, editor. Proceedings of the 68th Cold Spring Harbor Symposium: Genome of Homo sapiens. Cold Spring Harbor Press; New York: 2003. 4. She X, et al. Shotgun sequence assembly and recent segmental duplications within the human genome. Nature. 2004;431:927–30. [PubMed] 5. She X, et al. A preliminary comparative analysis of primate segmental duplications shows elevated substitution rates and a great-ape expansion of intrachromosomal duplications. Genome Res. 2006;16:576–83. [PubMed] 6. Sainz J, et al. Segmental duplication density decrease with distance to human-mouse breaks of synteny. Eur J Hum Genet. 2006;14:216–21. [PubMed] 7. Li J, et al. Genomic segmental polymorphisms in inbred mouse strains. Nat Genet. 2004;36:952–4. [PubMed] 8. Snijders AM, et al. Mapping segmental and sequence variations among laboratory mice using BAC array CGH. Genome Res. 2005;15:302–11. [PubMed] 9. Graubert TA, et al. A high-resolution map of segmental DNA copy number variation in the mouse genome. PLoS Genet. 2007;3:e3. [PubMed] 10. Egan CM, Sridhar S, Wigler M, Hall IM. Recurrent DNA copy number variation in the laboratory mouse. Nat Genet. 2007;39:1384–9. [PubMed] 11. Watkins-Chow DE, Pavan WJ. Genomic copy number and expression variation within the C57BL/6J inbred mouse strain. Genome Res. 2008;18:60–6. [PubMed] 12. Bailey JA, Eichler EE. Primate segmental duplications: crucibles of evolution, diversity and disease. Nat Rev Genet. 2006;7:552–64. [PubMed] 13. Nguyen DQ, Webber C, Ponting CP. Bias of selection on human copy-number variants. PLoS Genet. 2006;2:e20. [PubMed] 14. Armengol L, Pujana MA, Cheung J, Scherer SW, Estivill X. Enrichment of segmental duplications in regions of breaks of synteny between the human and mouse genomes suggest their involvement in evolutionary rearrangements. Hum Mol Genet. 2003;12:2201–8. [PubMed] 15. Bailey JA, Baertsch R, Kent WJ, Haussler D, Eichler EE. Hotspots of mammalian chromosomal evolution. Genome Biol. 2004;5:R23. [PubMed] 16. Armengol L, et al. Murine segmental duplications are hot spots for chromosome and gene evolution. Genomics. 2005;86:692–700. [PubMed] 17. Bailey JA, et al. Recent segmental duplications in the human genome. Science. 2002;297:1003–7. [PubMed] 18. Waterston R. Initial sequencing and comparative analysis of the mouse genome. Nature. 2002;420:520–62. [PubMed] 19. Bailey JA, Giu L, Eichler EE. An Alu transposition model for the origin and expansion of human segmental duplications. Am J Hum Genet. 2003;73:823–34. [PubMed] 20. Jurka J, Kohany O, Pavlicek A, Kapitonov VV, Jurka MV. Duplication, coclustering, and selection of human Alu retrotransposons. Proc Natl Acad Sci U S A. 2004;101:1268–72. [PubMed] 21. Sharp AJ, et al. Segmental duplications and copy-number variation in the human genome. Am J Hum Genet. 2005;77:78–88. [PubMed] 22. Tuzun E, et al. Fine-scale structural variation of the human genome. Nat Genet. 2005;37:727–32. [PubMed] 23. Perry GH, et al. Hotspots for copy number variation in chimpanzees and humans. Proc Natl Acad Sci U S A. 2006;103:8006–11. [PubMed] 24. Selzer RR, et al. Analysis of chromosome breakpoints in neuroblastoma at sub-kilobase resolution using fine-tiling oligonucleotide array CGH. Genes Chromosomes Cancer. 2005;44:305–19. [PubMed] 25. Reynard LN, et al. Expression analysis of the mouse multi-copy X-linked gene Xlr-related, meiosis-regulated (Xmr), reveals that Xmr encodes a spermatid-expressed cytoplasmic protein, SLX/XMR. Biol Reprod. 2007;77:329–35. [PubMed] 26. Kerr SM, Taggart MH, Lee M, Cooke HJ. Ott, a mouse X-linked multigene family expressed specifically during meiosis. Hum Mol Genet. 1996;5:1139–48. [PubMed] 27. Brennan PA, Zufall F. Pheromonal communication in vertebrates. Nature. 2006;444:308–15. [PubMed] 28. Sharrow SD, Vaughn JL, Zidek L, Novotny MV, Stone MJ. Pheromone binding by polymorphic mouse major urinary proteins. Protein Sci. 2002;11:2247–56. [PubMed] 29. Endrizzi MG, Hadinoto V, Growney JD, Miller W, Dietrich WF. Genomic sequence analysis of the mouse Naip gene array. Genome Res. 2000;10:1095–102. [PubMed] 30. Wright EK, et al. Naip5 affects host susceptibility to the intracellular pathogen Legionella pneumophila. Curr Biol. 2003;13:27–36. [PubMed] 31. Lee SH, et al. Susceptibility to mouse cytomegalovirus is associated with deletion of an activating natural killer cell receptor of the C-type lectin superfamily. Nat Genet. 2001;28:42–5. [PubMed] 32. IHGSC. Initial sequencing and analysis of the human genome. Nature. 2001;409:860–921. [PubMed] 33. Medstrand P, van de Lagemaat LN, Mager DL. Retroelement distributions in the human genome: variations associated with age and proximity to genes. Genome Res. 2002;12:1483–95. [PubMed] 34. Redon R, et al. Global variation in copy number in the human genome. Nature. 2006;444:444–54. [PubMed] 35. Kidd JM, et al. Mapping and sequencing of structural variation from eight human genomes. Nature. 2008;453:56–64. [PubMed] 36. Beck JA, et al. Genealogies of mouse inbred strains. Nat Genet. 2000;24:23–5. [PubMed] |
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||
Genome Biol. 2003; 4(8):R47.
[Genome Biol. 2003]Nature. 2004 Oct 21; 431(7011):927-30.
[Nature. 2004]Genome Res. 2006 May; 16(5):576-83.
[Genome Res. 2006]Eur J Hum Genet. 2006 Feb; 14(2):216-21.
[Eur J Hum Genet. 2006]Nat Genet. 2004 Sep; 36(9):952-4.
[Nat Genet. 2004]Genome Res. 2004 May; 14(5):789-801.
[Genome Res. 2004]Science. 2002 Aug 9; 297(5583):1003-7.
[Science. 2002]Genome Res. 2005 Feb; 15(2):302-11.
[Genome Res. 2005]Genome Biol. 2003; 4(8):R47.
[Genome Biol. 2003]Nature. 2002 Dec 5; 420(6915):520-62.
[Nature. 2002]Genome Res. 2004 May; 14(5):789-801.
[Genome Res. 2004]Genome Res. 2006 May; 16(5):576-83.
[Genome Res. 2006]Genome Res. 2005 Feb; 15(2):302-11.
[Genome Res. 2005]Am J Hum Genet. 2003 Oct; 73(4):823-34.
[Am J Hum Genet. 2003]Proc Natl Acad Sci U S A. 2004 Feb 3; 101(5):1268-72.
[Proc Natl Acad Sci U S A. 2004]PLoS Genet. 2007 Jan 5; 3(1):e3.
[PLoS Genet. 2007]Am J Hum Genet. 2005 Jul; 77(1):78-88.
[Am J Hum Genet. 2005]Proc Natl Acad Sci U S A. 2006 May 23; 103(21):8006-11.
[Proc Natl Acad Sci U S A. 2006]Genes Chromosomes Cancer. 2005 Nov; 44(3):305-19.
[Genes Chromosomes Cancer. 2005]Am J Hum Genet. 2005 Jul; 77(1):78-88.
[Am J Hum Genet. 2005]Nat Genet. 2007 Nov; 39(11):1384-9.
[Nat Genet. 2007]Biol Reprod. 2007 Aug; 77(2):329-35.
[Biol Reprod. 2007]Hum Mol Genet. 1996 Aug; 5(8):1139-48.
[Hum Mol Genet. 1996]Nature. 2006 Nov 16; 444(7117):308-15.
[Nature. 2006]Protein Sci. 2002 Sep; 11(9):2247-56.
[Protein Sci. 2002]Genome Res. 2000 Aug; 10(8):1095-102.
[Genome Res. 2000]Nature. 2001 Feb 15; 409(6822):860-921.
[Nature. 2001]Genome Res. 2002 Oct; 12(10):1483-95.
[Genome Res. 2002]Nature. 2006 Nov 23; 444(7118):444-54.
[Nature. 2006]Nature. 2008 May 1; 453(7191):56-64.
[Nature. 2008]Nat Genet. 2007 Nov; 39(11):1384-9.
[Nat Genet. 2007]Nat Genet. 2000 Jan; 24(1):23-5.
[Nat Genet. 2000]