• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of wtpaEurope PMCEurope PMC Funders GroupSubmit a Manuscript
Nat Genet. Author manuscript; available in PMC Apr 29, 2009.
Published in final edited form as:
Published online Nov 22, 2006. doi:  10.1038/ng1921
PMCID: PMC2674632
EMSID: UKMS4544

Genome assembly comparison identifies structural variants in the human genome

Abstract

Numerous types of DNA variation exist, ranging from SNPs to larger structural alterations such as copy number variants (CNVs) and inversions. Alignment of DNA sequence from different sources has been used to identify SNPs1,2 and intermediate-sized variants (ISVs)3. However, only a small proportion of total heterogeneity is characterized, and little is known of the characteristics of most smaller-sized (<50 kb) variants. Here we show that genome assembly comparison is a robust approach for identification of all classes of genetic variation. Through comparison of two human assemblies (Celera's R27c compilation and the Build 35 reference sequence), we identified megabases of sequence (in the form of 13,534 putative non-SNP events) that were absent, inverted or polymorphic in one assembly. Database comparison and laboratory experimentation further demonstrated overlap or validation for 240 variable regions and confirmed >1.5 million SNPs. Some differences were simple insertions and deletions, but in regions containing CNVs, segmental duplication and repetitive DNA, they were more complex. Our results uncover substantial undescribed variation in humans, highlighting the need for comprehensive annotation strategies to fully interpret genome scanning and personalized sequencing projects.

The most sensitive method for identifying all variation existing between two DNA donors is through direct comparison of accurately completed sequence assemblies of the genomes under study. For the human genome, there are two assembly products, one from the International Human Genome Sequencing Consortium (IHGSC)4 and another from Celera Genomics5, which used primarily clone-based sequencing and whole-genome shotgun sequencing, respectively. Although these assemblies have been evaluated for content and quality6-10, little effort has been made to make use of their differences to annotate new sequence variants.

As both assemblies represent mosaics of different donor DNA sources (with neither being fully completed), they are not the ideal substrate for comparison, but we show that much valuable data can be extracted. Our premise was to perform a thorough comparison between Celera's most complete assembly and the IHGSC reference sequence, herein called R27c (WGA2) and National Center for Biotechnology Information (NCBI) Build 35, respectively. R27c contains 2,830,275,312 bp in 14,071 scaffolds, and Build 35 contains 3,094,710,260 bp with an estimated 345 annotated gaps.

Although R27c contains some Build 35 sequences8, for this study we selected it over other Celera-only whole-genome shotgun assemblies (Build 35 also contains some Celera sequence). We rationalized that larger scaffolds would increase the likelihood of finding variants that might be missed using methods more sensitive to size restrictions, such as comparative genomic hybridization using arrays spotted with BAC clones (which has a lower limit of detection of ~50 kb)11 or fosmid-end sequencing (which, in current form, does not identify variants <8 kb or insertions >40 kb)3. As Build 35 is the human reference and has a higher nucleotide content than R27c, we focus our discussion on the sequences present or variable when comparing R27c with Build 35, but we also performed the reciprocal analysis.

We used MegaBLAST12 to align R27c to Build 35 and found 2,758,752,087 bp (97.5%) of matching sequence. We also used another alignment algorithm called A2Amapper8,13 (Table 1 and Supplementary Table 1). Then using the newly developed Genome Comparison Algorithm (GCA), we extracted variants between the assembly alignments. To reduce the potential for false positives owing to alignment errors, we describe only those differences found by GCA in both the MegaBLAST and A2Amapper comparisons (Table 2). We grouped these differences into five classes: (i) small sequence mismatches (including SNPs), (ii) unmatched sequences (including insertions, deletions and CNVs), (iii) copy-unmatched sequences (a subset of ii), (iv) inversions and (v) internal assembly gaps (Fig. 1 and Supplementary Tables 2-4). Any difference detected could represent actual difference between the DNA sources, an assembly artifact (computational or clone-induced) or alignment error.

Figure 1
Overview of the different types of alignments and assembly differences extracted from the R27c and Build 35 genome assemblies. (a) Matched alignments account for the majority of the sequence. (b) Mismatches are small intra-alignment differences ≤10 ...
Table 1
Overview of alignment results comparing the Celera R27c assembly (2,830,275,312 nt) with the Build 35 assembly (3,094,710,260 nt) of the human genome sequence
Table 2
Putative genetic variation detected by GCA

In the first class of small sequence changes, we identified 1,613,458 nucleotides; 1,591,291 (98.6%) of these represented single-nucleotide differences, and the remainder (22,167 bp) represented other small changes ≤10 bp in size.

In the second class, we found 13,066 regions totaling 23,859,805 bp of unmatched sequence (average size was 1,826 bp; Supplementary Table 4). We used stringent filtering criteria to obtain this data set of putative insertion and deletion variants. In addition to removing regions lacking support from both alignment tools and all regions shorter than 50 bp, we also removed regions with a repeat content >95% and sequences that could be realigned to Build 35 using BLAT14 (with >98% match, >50% coverage). Putative insertion points in Build 35 can be assigned for unmatched sequences when they are flanked by anchored neighboring alignments. In total, we are able to assign putative insertion coordinates for 4,536 unmatched fragments into Build 35 (corresponding to 10,469,693 bp; Fig. 2 and Supplementary Table 4).

Figure 2
Genome-wide overview of insertion points of unmatched and copy-unmatched sequences present in R27c with no corresponding match to Build 35. Each bar represents an insertion point, and the length of each bar indicates the size of the unmatched fragment ...

We used BLAT to compare R27c unmatched sequences to the chimpanzee assembly (NCBI Build 1) and identified 888 fragments covering 1,713,610 bp with a high identity match (>96% identity over 50% of the query). As these sequences have been identified both in humans and chimpanzees, they should represent either insertion or deletion polymorphisms or sequences missing in the reference genome. Next, we analyzed unmatched sequences in comparison with known genes. We found 903 RefSeq15 genes that contained insertion points for the R27c unmatched sequence. In a separate analysis, we aligned all RefSeq mRNAs to both assemblies and identified 26 human mRNAs with >50 bp of coding sequence present in R27c but missing in Build 35 (some of these mRNAs spanned or extended into gaps, whereas other sequences were simply not present) (Supplementary Table 5). For example, DOCK3 has an exon mapping within a sequence inverted in Build 35. To verify that the coding sequences missing in Build 35 are indeed represented as mRNA, we amplified and sequenced the cDNA from 14 different genes in five tissues (Supplementary Table 6) and obtained the expected results.

Copy-unmatched sequences are defined as fragments >1 kb that have two or more copies in R27c with >98% identity but have fewer copies present in Build 35. Thus, these regions represent putative CNVs but could also be explained by ubiquitous segmental duplications for which only one copy is annotated in Build 35. Celera shotgun reads have been used previously to identify regions of segmental duplications16, but this approach does not assign an insertion point in the assembly for the additional copies. We identified 419 copy-unmatched fragments, which had an average size of 8.6 kb. Of these, we were able to assign an insertion point for 287 fragments. We also compared the copy-unmatched fragments with the regions previously detected by shotgun read depth analysis, and 63% overlapped.

The last two classes included inversions and gap sequences. We detected 47 intrascaf-fold inversions, and two entire scaffolds were in inverse orientation in R27c. Gaps are regions that contain Ns in the query sequence. Defining this class was necessary for sequence accounting but was not relevant for variation studies.

To validate computational predictions and test for polymorphism, we performed PCR analysis, quantitative real-time PCR or FISH. Initially, we selected 49 regions (38 unmatched regions, six inversions and five copy-unmatched regions; see Methods and Supplementary Table 6). We performed PCR for unmatched and inversion regions on a panel of 12 controls from CEPH pedigrees. We tested copy-unmatched regions by quantitative PCR on a panel of 48 controls. We found that 17 of 38 (45%) unmatched regions, one of six (17%) inversions and two of five (40%) copy-unmatched regions were polymorphic (a total of 20 of 49, or 41%), with one allele supporting each assembly. For 19 of 38 (50%) unmatched regions, the unmatched sequence was found in each sample tested. Of these, three extend into gaps in the reference assembly. For the two remaining unmatched fragments, we detected only the Build 35 sequence, indicating that these represent rare variants, R27c assembly errors or alignment artifacts. For six regions where the unmatched sequence was present in all individuals tested, we examined the genomic clone used by the IHGSC to generate the reference sequence. In three of six cases, we detected the unmatched sequence, indicating that absence in Build 35 was likely to be due to cloning or assembly problems.

We performed FISH on three individuals using fosmid clones whose ends mapped within unmatched regions. We tested four types of regions: (i) 11 unmatched fragments with a location assigned in R27c, (ii) 21 fragments mapping to different chromosome locations in R27c and Build 35, (iii) six unanchored scaffolds with no coordinates assigned in either assembly and (iv) three scaffolds of uncertain orientation in R27c. Representative results for the first three categories are shown in Figure 3, and detailed results are summarized in Supplementary Table 7. The FISH analysis confirmed the expected mapping for seven fragments corresponding to Build 35 assembly gaps and two fragments corresponding to regions in which no gap is currently present in Build 35 (Fig. 3a–c). All FISH results for sequences assigned to different chromosomes in the two assemblies and for those with no coordinates assigned showed hybridization to multiple locations (Fig. 3d–f), often including centromere regions. The majority of these also demonstrated differences either in intensity or localization of hybridization signals between individuals. We experimentally verified that three scaffolds of uncertain orientation in R27c supported the orientation in Build 35 (Supplementary Table 7).

Figure 3
Fosmid probes were used for FISH experiments to confirm the R27c mapping of unmatched sequences to Build 35 or to find a location for sequences with inconsistent or no mapping information. (a) Unmatched region, with no gap in Build 35. The human COPG2 ...

We further assessed putative variants between assemblies by comparison with other data sources (Supplementary Tables 8 and 9). First, we found 1,521,291 of 1,591,291 (95.6%) single-nucleotide mismatches to be present in dbSNP; 840,802 of these were Celera-based SNPs, whereas the others were from different projects. We compared the unmatched and copy-unmatched categories with entries in the Database of Genomic Variants17. We found 331 CNVs to contain insertion points for unmatched regions and 55 CNVs with insertion points for copy-unmatched regions. Limiting the analysis to unmatched and copy-unmatched fragments >10 kb yielded support for 53 CNVs. Using data from ref. 18, we correlated the 913 CNVs detected by the whole-genome tile path clone array with unmatched and copy-unmatched sequences. We found a significant correlation, with 254 CNVs overlapping unmatched insertion points and 74 CNVs overlapping copy-unmatched sequences (P < 0.0001 for both). Of these, 88 unmatched and 13 copy-unmatched regions were >10 kb, indicating that they may explain the CNV detected. We also assessed the overlap with variants identified by the fosmid end-pair mapping approach and found support for 23 insertions identified in ref. 3. Comparison of the entire unmatched data set (including those with repeat content >95%) with the dbRIP retrotransposon polymorphism database19 yielded support for another 54 polymorphic regions, all of which corresponded to single short interspersed nuclear elements (SINE) or long interspersed nuclear elements (LINE). We compared the 49 inversions with entries in the Database of Genomic Variants17; 12 corresponded to previously identified inversion polymorphisms.

A total of 3,246,015 bp of R27c sequence extended into Build 35 gaps and 1,110/4,536 (24.5%) unmatched fragments and 174/287 (60.6%) copy-unmatched sequences had insertion points in annotated segmental duplications. We also noted a strong association of insertions with segmental duplications for regions detected by the fosmidend mapping approach3.

Alternate sequence assemblies have been created previously for specific subregions of the human genome20-22, facilitating the understanding of chromosome architecture. The results presented here confirm that whole-genome assembly comparison is the most sensitive way of identifying all types of genetic variation and that there is no limit to the size of the variants found. We provide experimental evidence that >40% of predicted unmatched regions can be confirmed experimentally and that many others correspond to known variable regions. As the current study is limited to two genome assemblies, most genetic variants will be presented by the major allele. Even the most conservative extrapolations, therefore, suggest that significantly more variation exists between humans than was previously estimated4,5. Moreover, we show that alternate assemblies can be used to contribute to the generation of a more complete reference sequence.

As an era of personalized sequencing approaches23-25, our results emphasize that developing effective strategies for extracting the most relevant data will rely on a comprehensive understanding of the content of both test and comparator sequences.

METHODS

Assemblies and alignment algorithm

Build 35 sequences were downloaded from NCBI. Sequences for the R27c assembly were obtained from Celera but are also publicly available from NCBI with accession number AADB02000000. For detailed information on the Genome Comparison Algorithm, see Supplementary Methods. Briefly, chromosome sequence assemblies from NCBI Build 35 were compared with all scaffold sequences from R27c using MegaBLAST12. The resulting alignments were converted to GFF3 format, recording detailed alignment information. GFF3 records were sorted by raw score. A greedy algorithm applied to the GFF3 records preferentially selected optimal alignments (that is, alignments with the highest raw score), eliminated suboptimal alignments and created a nonredundant set of nonoverlapping alignments by cutting GFF3 alignment records. Unmatched sequence was determined by identifying intervening sequence between alignment records. Copy-unmatched sequence was determined by searching among suboptimal alignments for sequence already matched in one assembly, but not the other, using a cutoff of 1 kb in length and >98% sequence identity. Inversions were determined by identifying alignments whose orientation was different than adjacent alignments. The complete data set is available upon request.

Correlation with genomic features

Analyses of correlations with genomic features were performed using standard data sets. The Ref Seq gene set and Ref Seq mRNAs15 were downloaded from NCBI. Information about CNVs was retrieved from the Database of Genomic Variants (http://projects.tcag.ca/variation/). Coordinates for segmental duplications were extracted from Human Genome Segmental Duplication Database (http://projects.tcag.ca/humandup/)26, and whole-genome shotgun sequence detection (WSSD) regions and gap coordinates were downloaded from the UCSC Human Genome Browser27. The repeat content of unmatched sequences was determined using RepeatMasker (A.F.A. Smit, R. Hubley & P. Green, Institute for Systems Biology, Seattle; see http://www.repeatmasker.org). Detailed information regarding the genomic feature overlap analyses is given in Supplementary Methods.

PCR reactions

PCR experiments were designed as previously described for unmatched regions3 and inversions28, with reagents and optimization criteria as in ref. 28. In order to enrich for potential polymorphisms and simplify experimental design, a number of selection criteria were applied for identification of candidate regions. Regions chosen for experimental validation were all intrascaffold sequences, with <50% repeat content and <50% repeat content in the 1 kb on each side flanking the insertion point. Regions where other assembly differences mapped immediately adjacent to the insertion point in Build 35 were also avoided. See Supplementary Table 6 for primer sequences.

Quantitative PCR

Variability in copy number in copy-unmatched regions was tested by quantitative real-time PCR, using probes from the Human Universal Probe Library (Roche Diagnostics). The change in copy number was calculated as previously described29. DNA from a total of 48 unrelated HapMap individuals of European ancestry was used to assess the existence of variability in copy number. We used 20 ng of total DNA in each of two replicates. Replicates with a variation coefficient over 4% were discarded. The myoglobin B IX gene was used as the reference for the relative quantifications. See Supplementary Table 6 for primer sequences.

FISH and probe selection

Fosmid clones were used as probes for FISH experiments. Fosmid end-sequences from NCBI (ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/FOSMIDS/) were downloaded and aligned to the R27c and Build 35 genome assemblies using BLAT. Best unique matches were retrieved, and fosmids that mapped only to the Celera genome, fosmids with discrepancies in span size between assemblies and fosmids with best reciprocal matches found on different chromosomes were all recorded and compared with the GCA output. All FISH experiments were performed on three samples using standard protocols as described previously17,28,30. Five to ten selected metaphases were examined using fluorescence microscopy, analyzed and imaged. Three-color interphase FISH for inversion testing was performed as previously described30.

Supplementary Material

Supplementary methods

Supplementary Table 1

Supplementary Table 2

Supplementary Table 3

Supplementary Table 4

Supplementary Table 5

Supplementary Table 6

Supplementary Table 7

Supplementary Table 8

Supplementary Table 9

ACKNOWLEDGMENTS

We thank T. Tang, L. Wong, J. Wittnam, C.-F. Chu and W. Hwang of The Centre for Applied Genomics for technical assistance. Computational analyses were supported by the Shared Hierarchical Academic Research Computing Network (SHARCNET) and the Centre for Computational Biology at the Hospital for Sick Children. The work was supported by Genome Canada/Ontario Genomics Institute, the Canadian Institutes of Health Research (CIHR), the Canada Foundation for Innovation and the McLaughlin Centre for Molecular Medicine (all to S.W.S). L.A. and X.E. are supported by Genoma España and Genome Canada joint R+D+I projects and by the Generalitat de Catalunya (Departament d'Universitats, 2005SGR00008, and Departament de Salut). L.F. is supported by CIHR. S.W.S. is an Investigator of CIHR and International Scholar of Howard Hughes Medical Institute.

Footnotes

COMPETING INTERESTS STATEMENT: The authors declare that they have no competing financial interests.

Reprints and permissions information is available online at http://npg.nature.com/reprintsandpermissions/

References

1. Marth GT, et al. A general approach to single-nucleotide polymorphism discovery. Nat. Genet. 1999;23:452–456. [PubMed]
2. Tsui C, et al. Single nucleotide polymorphisms (SNPs) that map to gaps in the human SNP map. Nucleic Acids Res. 2003;31:4910–4916. [PMC free article] [PubMed]
3. Tuzun E, et al. Fine-scale structural variation of the human genome. Nat. Genet. 2005;37:727–732. [PubMed]
4. Lander ES, et al. Initial sequencing and analysis of the human genome. Nature. 2001;409:860–921. [PubMed]
5. Venter JC, et al. The sequence of the human genome. Science. 2001;291:1304–1351. [PubMed]
6. Myers EW, Sutton GG, Smith HO, Adams MD, Venter JC. On the sequencing and assembly of the human genome. Proc. Natl. Acad. Sci. USA. 2002;99:4145–4146. [PMC free article] [PubMed]
7. Adams MD, Sutton GG, Smith HO, Myers EW, Venter JC. The independence of our genome assemblies. Proc. Natl. Acad. Sci. USA. 2003;100:3025–3026. [PMC free article] [PubMed]
8. Istrail S, et al. Whole-genome shotgun assembly and comparison of human genome assemblies. Proc. Natl. Acad. Sci. USA. 2004;101:1916–1921. [PMC free article] [PubMed]
9. Waterston RH, Lander ES, Sulston JE. On the sequencing of the human genome. Proc. Natl. Acad. Sci. USA. 2002;99:3712–3716. [PMC free article] [PubMed]
10. Waterston RH, Lander ES, Sulston JE. More on the sequencing of the human genome. Proc. Natl. Acad. Sci. USA. 2003;100:3022–3024. [PMC free article] [PubMed]
11. Feuk L, Carson AR, Scherer SW. Structural variation in the human genome. Nat. Rev. Genet. 2006;7:85–97. [PubMed]
12. Zhang Z, Schwartz S, Wagner L, Miller W. A greedy algorithm for aligning DNA sequences. J. Comput. Biol. 2000;7:203–214. [PubMed]
13. Mobarry C, Sutton G. An assembly-to-assembly comparison tool; Proceedings of the Third Annual RECOMB Satellite Meeting on DNA Sequencing Technologies and Computation; 2003.
14. Kent WJ. BLAT–the BLAST-like alignment tool. Genome Res. 2002;12:656–664. [PMC free article] [PubMed]
15. Pruitt KD, Tatusova T, Maglott DR. NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 2005;33:D501–D504. [PMC free article] [PubMed]
16. Bailey JA, et al. Recent segmental duplications in the human genome. Science. 2002;297:1003–1007. [PubMed]
17. Iafrate AJ, et al. Detection of large-scale variation in the human genome. Nat. Genet. 2004;36:949–951. [PubMed]
18. Redon R, et al. Global variation in copy number in the human genome. Nature. in the press.
19. Wang J, et al. dbRIP: a highly integrated database of retrotransposon insertion polymorphisms in humans. Hum. Mutat. 2006;27:323–329. [PMC free article] [PubMed]
20. Hillier LW, et al. The DNA sequence of human chromosome 7. Nature. 2003;424:157–164. [PubMed]
21. Scherer SW, et al. Human chromosome 7: DNA sequence and biology. Science. 2003;300:767–772. [PMC free article] [PubMed]
22. Schmutz J, et al. The DNA sequence and comparative analysis of human chromosome 5. Nature. 2004;431:268–274. [PubMed]
23. Shendure J, Mitra RD, Varma C, Church GM. Advanced sequencing technologies: methods and goals. Nat. Rev. Genet. 2004;5:335–344. [PubMed]
24. Bennett ST, Barnes C, Cox A, Davies L, Brown C. Toward the 1,000 dollars human genome. Pharmacogenomics. 2005;6:373–382. [PubMed]
25. Service RF. Gene sequencing. The race for the $1000 genome. Science. 2006;311:1544–1546. [PubMed]
26. Cheung J, et al. Genome-wide detection of segmental duplications and potential assembly errors in the human genome sequence. Genome Biol. 2003;4:R25. [PMC free article] [PubMed]
27. Kent WJ, et al. The human genome browser at UCSC. Genome Res. 2002;12:996–1006. [PMC free article] [PubMed]
28. Feuk L, et al. Discovery of human inversion polymorphisms by comparative analysis of human and chimpanzee DNA sequence assemblies. PLoS Genet. 2005;1:e56. [PMC free article] [PubMed]
29. Pfaffl MW. A new mathematical model for relative quantification in real-time RT-PCR. Nucleic Acids Res. 2001;29:e45. [PMC free article] [PubMed]
30. Osborne LR, et al. A 1.5 million-base pair inversion polymorphism in families with Williams-Beuren syndrome. Nat. Genet. 2001;29:321–325. [PMC free article] [PubMed]
PubReader format: click here to try

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

  • Compound
    Compound
    PubChem Compound links
  • MedGen
    MedGen
    Related information in MedGen
  • Nucleotide
    Nucleotide
    Published Nucleotide sequences
  • PubMed
    PubMed
    PubMed citations for these articles
  • Structure
    Structure
    Published 3D structures
  • Substance
    Substance
    PubChem Substance links

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...