![]() | ![]() |
Formats:
|
||||||||||||||||||
Copyright © 2005, Cold Spring Harbor Laboratory Press A genome-wide survey of structural variation between human and chimpanzee 1 Department of Genome Sciences, University of Washington School of Medicine, Seattle, Washington 98195, USA 2 Case Western Reserve University School of Medicine, Department of Genetics, Cleveland, Ohio 44106, USA 3 Sezione di Genetica, DAPEG, University of Bari, 70126 Bari, Italy 4 Howard Hughes Medical Institute, Seattle, Washington 98195, USA 5Corresponding author. E-mail eee/at/gs.washington.edu; fax (206) 685-7301. Received June 24, 2005; Accepted August 22, 2005. This article has been cited by other articles in PMC.Abstract Structural changes (deletions, insertions, and inversions) between human and chimpanzee genomes have likely had a significant impact on lineage-specific evolution because of their potential for dramatic and irreversible mutation. The low-quality nature of the current chimpanzee genome assembly precludes the reliable identification of many of these differences. To circumvent this, we applied a method to optimally map chimpanzee fosmid paired-end sequences against the human genome to systematically identify sites of structural variation ≥12 kb between the two species. Our analysis yielded a total of 651 putative sites of chimpanzee deletion (n = 293), insertions (n = 184), and rearrangements consistent with local inversions between the two genomes (n = 174). We validated a subset (19/23) of insertion and deletions using PCR and Southern blot assays, confirming the accuracy of our method. The events are distributed throughout the genome on all chromosomes but are highly correlated with sites of segmental duplication in human and chimpanzee. These structural variants encompass at least 24 Mb of DNA and overlap with >245 genes. Seventeen of these genes contain exons missing in the chimpanzee genomic sequence and also show a significant reduction in gene expression in chimpanzee. Compared with the pioneering work of Yunis, Prakash, Dutrillaux, and Lejeune, this analysis expands the number of potential rearrangements between chimpanzees and humans 50-fold. Furthermore, this work prioritizes regions for further finishing in the chimpanzee genome and provides a resource for interrogating functional differences between humans and chimpanzees. Sites of structural variation (SVs) have considerable potential to impart both functional and irreversible difference between evolving species. In particular, the whole or partial deletion of genes has been proposed as one of the primary forces responsible for human evolution (Olson 1999). While cytogenetic comparisons of human and chimpanzee karyotypes have been effective in detecting large-scale (>5 Mb) SVs (Lejeune et al. 1973; Dutrillaux 1980; Yunis et al. 1980; Yunis and Prakash 1982), they are insensitive to submicroscopic changes. At the sequence level, single-base-pair nucleotide substitutions have been surveyed between these primate genomes and estimated to account for a 1.2% nucleotide difference between humans and chimpanzees (Kumar and Hedges 1998; Eichler et al. 2004b; The Chimpanzee Sequencing and Analysis Consortium 2005). The extent of variation affecting sequences larger than a few kb but too small to identify cytogenetically (<~5Mb) has been difficult to resolve strictly by cytogenetic, microarray-based, or sequence-based methods. Comparative primate studies of segmental duplications (Jackson et al. 1999; Johnson et al. 2001; Stankiewicz et al. 2001; Samonte and Eichler 2002; Horvath et al. 2003) as well as comparisons between finished chimpanzee and human BACs (Britten 2002; Liu et al. 2003) suggest that such variation is common between the species. To date, however, there has been no systematic whole-genome assessment of such variation. Assessment of these intermediate-sized insertion, deletion, and inversion events is critical as variation of this size has great potential to affect the structure and genic complement in each species (Albertson et al. 2000; Snijders et al. 2001; Stankiewicz et al. 2001, 2004; Enard et al. 2002; Locke et al. 2003a,b, 2005; Lupski 2004; Sharp et al. 2005; Tuzun et al. 2005). An understanding of such structural and functional differences is required to provide a more balanced perspective of the seemingly disparate phenotypic differences that distinguish humans and our closest primate relatives. Structural variation can lead to duplication or deletion of sequence elements, thereby creating species-specific exons, genes, or regulatory regions. A comparison of the mouse and human genome estimated that as much as 400 Mb of genetic material has been deleted in the mouse genome since the divergence of these two mammals 70–90 million years ago (Mouse Genome Sequencing Consortium (MGSC) 2002). The rate and impact of deletion/insertion/inversion between more closely related species has not been systematically addressed. Moreover, both gene duplication and gene loss have been proposed as important forces driving the evolution of the human lineage, but the relative importance of each with respect to human evolution has not been established (Ohno 1970; Olson 1999; Samonte and Eichler 2002). A complete catalog of all structural variation between humans and chimpanzees provides the framework to enable a better evaluation of the relative importance of each process. The whole genome shotgun sequencing method (WGS) used for construction of the current chimpanzee assembly does not allow reliable detection of structural variation for two reasons. First, the current chimpanzee assembly contains a gap, on average, once every 8 kb (The Chimpanzee Sequencing and Analysis Consortium 2005). Second, the chimpanzee genome is still in draft form and, thus, contains many errors where the sequence has been fragmented, misassembled, or collapsed (The Chimpanzee Sequencing and Analysis Consortium 2005). Both gaps and improper assembly can create artifacts in pairwise genome alignments leading to unacceptable false discovery rates. Various attempts to identify a subset of chimpanzee deletions using the chimpanzee draft assembly have been made (The Chimpanzee Sequencing and Analysis Consortium 2005), including an analysis that characterized deletions (>15 kb in size) based on paired-end sequence analysis. A systematic analysis that considers insertions, deletions, and inversions, however, has not been performed. Recently we developed a method for the systematic characterization of intermediate-sized structural variation (ISV) by optimal placement of fosmid paired-end sequences against the human genome reference sequence (Tuzun et al. 2005). The power of this approach stems from the stability and packaging constraints of the fosmid vector. These properties result in both genomic fidelity of inserts as well as a tight distribution of insert size around the mean. Given sufficient coverage, the presence of multiple fosmid pairs discordant by size or by orientation provides a useful metric to identify sites of structural variation. This method has been used to reliably identify insertions, deletions, and inversions between a single human individual and the human reference assembly with high (>8 kb) resolution (Tuzun et al. 2005). In this study, we perform a similar analysis in which we initially ignore the chimpanzee genome assembly and instead use a library of chimpanzee fosmid end sequences to compare the genome of a single chimpanzee individual against the human reference sequence. During the chimpanzee genome sequencing project, ~1.8 million fosmids were end-sequenced, providing ~10-fold physical coverage of the genome. Because the forward and reverse sequence reads from each fosmid are physically linked in the chimpanzee genome, and capillary sequencing has essentially eliminated tracking errors, placement of these reads to the high-quality finished human assembly provides comparable power to detect structural variation between the two species (Eichler et al. 2004a; Tuzun et al. 2005). Implementation of this approach with chimpanzee data allowed us to double the number of putative large deletions (>12 kb) and provide one of the first comprehensive maps of structural variation between the two genomes. Results We initially mapped ~1.8 million high-quality paired-end sequence reads from the chimpanzee fosmid library against the finished human genome reference sequence to identify discrepant regions (putative ISVs). To reduce the effect of sequencing errors, each fosmid end-sequence read was rescored based on trace quality, and only fosmids with high-quality reads (Phred ≥30) were retained for mapping (see Methods). In addition, during mapping we selected reads that unambiguously represented the “best match” for a particular region of the human genome. This “best match” criteria biased our set of mapped fosmid paired-end reads to regions where there was sufficient sequence divergence to unambiguously discern orthology—excluding many duplicated regions. We further excluded 137,110 clones either with sequence at only one end or with duplicated entries. Using these criteria we successfully mapped 976,000 (55%) of the ~1.8 million chimpanzee fosmid sequences on the human assembly. These mapped pairs represent ~20 Gb of DNA and therefore span ~6.8× physical coverage of the genome (see Methods). Putative ISVs were identified by mapping each pair of chimpanzee fosmid end sequences to the human genome and recording locations where the distance between the two ends in the human assembly was “larger” or “smaller” than expected, based on the average span of mapped fosmid insert sizes across the genome as a whole (Fig.1A
For the purpose of this study, we operationally defined all discordant sites with respect to the chimpanzee genome. Regions which showed two or more fosmids that were >49.5 kb were classified as “chimpanzee deletions.” Similarly, chimpanzee fosmids for which multiple fosmid pairs mapped too closely (<24.9 kb) based on the human reference genome were termed “chimpanzee insertions.” It should be noted, however, that such events could also, in principle, represent human-specific insertions and deletions, respectively (see below). These thresholds allowed us to detect putative insertion/deletion events >12 kb in size. All regions were graphically visualized (parasight software) and hand-curated based on additional criteria (see Methods). Chimpanzee deletion events We initially identified ~550 putative “chimpanzee deletions,” where two or more independent chimpanzee fosmid pairs predicted an insert size >49.5 kb when compared with the human genome (Fig. 1B
As a second measure of validation, and in order to assess the lineage-specificity of these events, we experimentally characterized nine chimpanzee deletion events. First, six PCR assays were designed based on flanking conserved sequences adjacent to the chimpanzee deletion such that PCR amplification would readily amplify the deleted variant (Fig. 2E As a more direct test, we designed hybridization probes specific to the deleted sequence for an additional three sites and performed Southern hybridization experiments against a primate panel of genomic DNA. All three of the experiments (chromosome 10, Fig. 2F It is unlikely that all 293 putative chimpanzee deletion regions are fixed differences between humans and all chimpanzees. SNP data suggests that ~14%–22% of single nucleotide differences between human and chimpanzee genomes are actually polymorphic within chimpanzee populations (Chen and Li 2001; Ebersberger et al. 2002). We evaluated this expectation for ISVs by examining the human sequence internal to the deletion regions (between discordant pairs and lacking concordant pair coverage) against the sequence libraries of two other western and three central chimpanzees (The Chimpanzee Sequencing and Analysis Consortium 2005). By retaining sequences of ≥95% identity to chimpanzee sequences >500 bp or more, and further requiring that ≥1000 bp of the internal coordinates of the deletion region aligned, we identified 97 (an upper bound) regions that did match sequence in at least one other chimpanzee individual. If we assume these regions are polymorphic in the chimpanzee population, it suggests that as much as 33% of the sites that vary between human and chimpanzee also vary within chimpanzee populations. However, this analysis cannot distinguish between false positives and polymorphisms and as such may be an overestimate. A second, more direct approach was to identify polymorphisms within the two haplotypes of the chimpanzee individual's genome. In our initial analysis we excluded deletion polymorphisms by focusing on regions that showed multiple fosmids that were discordant by size (“too large”) and the absence of sequence read data underlying the region of putative structural variant. If we eliminate the second criterion, we identify a comparable number of putative deletion regions where there is both discordancy and concordancy when compared with the human genome (n = 266). These data suggest that the ratio of fixed to polymorphic events is ~1:2 (196:363), and is much lower than similar estimates for SNPs (2:1). It is possible that these differences may be attributed to the strong association of structural variation with segmental duplications (sites of recurrent rearrangement) between the two species. We examined all 293 “chimpanzee deletions” with respect to annotation of the human genome assembly. Similar to structural variation in humans (Iafrate et al. 2004; Sebat et al. 2004; Sharp et al. 2005; Tuzun et al. 2005), the sequence between the breakpoints of 41% (120/293) of the chimpanzee deletions overlaps with human segmental duplication (SD) sequence (Supplemental Table 1). There are 10 chimpanzee deletion events whose breakpoints fall within 80 kb (the combined bounds of resolution for the results of both analyses) of the coordinates bounding human SVs (Supplemental Table 6). Among the 178 RefSeq gene regions that intersect with these deletion regions (Supplemental Table 2), we found representatives of many duplicated gene families, including drug-detoxification (glycosyltransferase family, cytochrome P450 genes), immunity (chemokine, cytokine, MLC, HLA, and defensin families), and pregnancy-related proteins. We specifically compared all possible human RefSeq exons (n = 1001) underlying these fixed sites of structural variation to both the chimpanzee genome assembly and chimpanzee WGS. One hundred fifty exons, corresponding to 78 RefSeq genes, matched no chimpanzee sequence with ≥50 bp of ≥95% identity, suggesting that true orthologs of these 150 exons are not present in the genome of chimpanzees. However, only two of these 150 exons showed no sequence identity to other human gene models, indicating that the majority of exons within in these SVs arise from duplicate gene families and have paralogs elsewhere in the chimpanzee genome. We tested whether these genes (n = 78) lacking exons might show an altered pattern of gene expression between the two species due potentially to altered reading frames, premature stop codons, and nonsense-mediated mRNA decay. We obtained human–chimpanzee expression data for 40 genes from a recently published microarray study from five tissues (brain, heart, liver, kidney, and testis; Khaitovich et al. 2005). Forty-two percent (17/40) of the genes showed reduced levels of expression in chimpanzee, while 15% (6/40) showed higher levels of expression in the chimpanzee (Supplemental Table 3). The remaining 17 genes did not report any significant differences in the expression assay. The number of genes (17, or 42%) with reduced chimpanzee expression was shown to be significantly (p < 0.01) higher than expected by chance from randomly sampling 40 genes from the total dataset 10,000 times (see Methods). In the majority of the cases (35/40), the probe sets map outside of the deletion region in question (Khaitovich et al. 2005). In four of the five remaining cases, the probe sets map at the periphery (<10 kb) of the predicted boundaries of the deletion. The correlation, thus, seems to be the result of lowered gene expression rather than absence of reporting probe sets. We found no evidence of relaxed selection operating on these particular genes when 1:1 orthologs were examined between human and mouse. For example, the average dN/dS value for a subsample of 15 genes was 0.49 and the median was dN/dS = 0.114, very similar to the median Ka/Ks value of 0.115 found for a set of ~12,000 human–mouse orthologs (MGSC 2002). Chimpanzee insertions Similar to our deletion analysis, we required two separate criteria to classify potential insertions. First, we identified regions (n = 350) marked by two or more discordant fosmids with an insert size <24.9 kb. Since true insertions would create disruptions in paired-end continuity, we searched for the presence of singletons flanking each of these sites—fosmids for which both end sequences were high-quality but only one end of the pair mapped to the human reference sequence both at this site and in the orientation of the discontinuity. From the 350 discordant regions, we identified 164 regions with at least two discordant pairs (<24.9 kb) that were also flanked by multiple singletons orientated toward the discontinuity. Because the even distribution of unambiguously mapped end sequences can be disrupted by the presence of repeats or duplications at the breakpoints of insertions, we required the presence of singletons flanking each insertion event. An example of a ~27-kb putative chimpanzee insertion event on chromosome 1 is shown in Figure 3A
Unlike deletion detection, an important caveat to our approach is that insertion events >40 kb cannot be readily captured due to packaging constraints of the fosmid cloning system. We therefore performed a separate analysis in which we identified clusters of “singletons” (see Methods) bracketing a discontinuity in clonal coverage (both discordant and concordant clones). We identified 20 additional putative chimpanzee-specific insertions in which a clone discontinuity was flanked on either side by at least two singletons. Although the precise length of these 20 insertions is unknown, this raises the total number of putative “chimpanzee insertions” to 184 (Supplemental Table 1) and identifies regions for more targeted sequence and assembly. Similar to the deletion analysis, we compared these regions to the chimpanzee assembly and confirmed 54% (100/184) of the insertions in which the chimpanzee contained >12 kb of unalignable sequence when compared with the human at that position. Seven of these corresponded to sites of chimpanzee-specific retroviral insertions (PTERV1) (Yohn et al. 2005). We tested 13 regions with PCR in human, chimpanzees, and other primates to experimentally verify putative chimpanzee insertion events. PCR assays were designed to amplify the sequence spanning the site of structural variation in human but not the longer insertion site in chimpanzee (Fig. 3E In stark contrast to the high percentage of chimpanzee deletions overlapping human SDs, the breakpoints of only 7.6% (14/184) of the chimpanzee insertions overlap with human segmental duplication sequence (Supplemental Table 1). We find three chimpanzee insertion events that map within 80 kb of the coordinates of human SVs (Supplemental Table 6). Only 54 of these insertion sites intersected with coordinates for human Ref-Seq genes (Supplemental Table 2), including the genes SPAG6 (important for spermatic flagellum development), SOX5 (associated with SRY function), and BARD1 (forms a heterodimer with BRCA1 required for proper apoptotic function). Thirty-three genes contained in this set were also tested for expression differences between humans and chimpanzee (Khaitovich et al. 2005). Five showed significant under-expression and four showed significant over-expression in chimpanzee (Supplemental Table 3), which was not significantly different than expected by simulation (see Methods). The remaining 24 genes showed no significant change in expression between the species. Inversions We identified 174 regions where two or more chimpanzee fosmid paired-end sequences showed an inconsistent orientation with respect to the human genome assembly (Fig. 1B
Fifteen of the events span >20 Mb of distance and, thus, if they are conventional inversions rather than inverted duplications, they should have been clearly visible at the cytogenetic level. An example of a known pericentric inversion on chromosome 12 is shown in Figure 4B
We note a very strong association of the inversion events with the locations of human segmental duplications. Of the 174 putative inversions, 78% overlap with human SDs. Notably, the putative inversion events identified by our method also overlap with chimpanzee SDs in 112 cases (64%). As discussed, this overlap with SDs significantly decreases the ability of our method to differentiate between duplicative transposition of material and more conventional inversions such as the large pericentric events. We identified 16 chimpanzee inversion events whose breakpoints map within 80 kb of a known human SV event (Supplemental Table 6). The breakpoints of the 41 double-ended conventional inversion events overlap with the coding region from 14 RefSeq genes (Supplemental Table 2). Given that the gene structure described is based on the human reference sequence, the coding regions of these 14 genes are possibly discontinuous in the chimpanzee genome. These 14 genes include a chemokine protein and a homeobox protein as well as several zinc fingers and hypothetical proteins. Five genes also correspond to genes tested for expression differences between human and chimpanzee by Khaitovich et al. (2005) (Supplemental Table 3). However, only one of these five genes reports any hybridization signal in any of the five tissues tested, and does not show a difference in expression between the two species (Supplemental Table 3). Discussion We have performed the first genome-wide assay of intermediate-scale structural variation between humans and chimpanzees by mapping chimpanzee fosmid paired-end sequences against the human reference sequence and identifying discordant regions by size and/or orientation. The method we have developed takes advantage of the high-quality reference of the human genome assembly and properties of the fosmid cloning system. We have demonstrated its potential to characterize interspecific structural variation in the absence of a genome assembly. Although we limited our analysis to the human and chimpanzee genomes, our approach to detect structural variation could be readily applied to any pair of genomes for which the genetic distance is relatively short (i.e., nucleotide divergence <10%) and one of the two genomes exists as a high-quality reference. Various species, subspecies, or strains of Drosophila, yeast, or mouse could be characterized in this fashion without the need to generate independent WGS assembly for each sibling species. While this approach offers exquisite precision and resolution over other array-based approaches (Locke et al. 2003b; Fortna et al. 2004), it also suffers a number of limitations. First, proper placement of clone sequence ends requires a high-quality reference genome. Regions of incorrect assembly will yield discordant clones that represent false positives. Likewise, the human reference genome is incomplete (Eichler et al. 2004a) and sequence exists in the chimpanzee genome that is not represented in the human reference. Structural variation within these regions cannot be readily captured, leading to false negatives in the analysis. Second, this approach is expensive compared with techniques such as arrayCGH, as it requires considerable up-front investment in creating clone libraries and generating 0.3- to 0.4-fold sequence coverage of a genome. In the absence of significant cost reductions in sequencing and clone storage, it is currently not practical to apply this technique to screening large numbers of individuals. Finally, at the most stringent level, this method utilizes only those clones that map unambiguously to the reference genome, creating a significant bias against analysis of regions with recent or highly similar repeats and duplications. In this analysis, we identified 651 regions of putative structural variation between the human genome assembly and a single chimpanzee individual (293 chimpanzee deletions, 184 chimpanzee insertions, and 174 inversions/duplicative transpositions; Table 2). Because these data were generated from a single chimpanzee individual, as much as ~1/4 of these sites may be polymorphic within the chimpanzee population (The Chimpanzee Sequencing and Analysis Consortium 2005). Future interrogation of these sites in multiple chimpanzee individuals is required to discriminate between interspecific and intraspecific variation. Notwithstanding polymorphism, this analysis potentially increases the number of known structural variants between our two species by a factor of 50 beyond what was originally documented by cytogenetic techniques (Lejeune et al. 1973; Dutrillaux 1980; Yunis et al. 1980; Yunis and Prakash 1982). Details concerning the location of these structural variants mapped against the finished human genome may be found at http://humanparalogy.gs.washington.edu/CSV.
These data serve two purposes. First, they provide a road map of regions of structural variation for further attention during the second phase of the chimpanzee genome assembly. Many of these regions were not properly assembled in the published version of the genome and we now have identified the specific fosmid clones for further characterization. Second, our set of disrupted or deleted genes provides a resource for interrogating differences between human and chimpanzee species at a functional level. An important question that remains unaddressed is whether deletion and insertion events are symmetric or asymmetric with respect to frequency or abundance between human and chimpanzee lineages of evolution (Olson 1999; Locke et al. 2003a,b, 2004; Fortna et al. 2004). At first blush, it may appear that chimpanzee deletions outpace insertions (1.6:1 by count or 8:1 by bp in our analysis; Supplemental Table 1). However, with the exception of a small subset (n = 20) we have not determined the lineage-specificity of the majority of the events. Additionally, it is important to note that our fosmid-based approach creates a considerable bias against detecting large (>40 kb) chimpanzee insertions versus deletions, partially explaining the differences in event numbers and base pairs involved. If we limit our analysis to events estimated between 12.5–36.5 kb, we find that the margin narrows. One hundred sixty-four chimpanzee “insertion” events (2.7 Mb), were identified at this range, compared with 174 chimpanzee “deletion” events (3.9 Mb of DNA). At the chromosomal level, the pattern of deletions, insertions, and inversion events mapped to the human reference assembly does not indicate any obvious genome-wide bias for the location of structural variants (Fig. 5
At the regional level, certain areas show local hotspots for one or more types of variation. For example, the probability of observing four or more insertion or deletion events within a 1-Mb region by chance is <<0.001 (see Methods), suggesting that the events we predict in three regions (on chromosome 1, chromosome 2, and chromosome 14) may represent genomic hotspots of variation (see Methods). At the sequence level, a strong association emerges with respect to segmental duplications. The strongest association is observed for chimpanzee inversions and deletions. On average, 78% and 41% of the chimpanzee inversion and deletion events, respectively, overlap with SDs despite the fact that SDs make up only 5% of the human genome. The finding that 78% of the inversions overlap human SDs is not unexpected, since our algorithm cannot distinguish between conventional inversion events and duplicative transposition of sequence material. The enrichment seen in chimpanzee deletions extends and corroborates recent findings from human variation studies that show a similar structural bias to such regions (Fredman et al. 2004; Iafrate et al. 2004; Sebat et al. 2004; Sharp et al. 2005; Tuzun et al. 2005). Surprisingly, the enrichment is much less pronounced for chimpanzee insertion events, where only 8% map to sites of segmental duplication. In total, we have identified 245 separate RefSeq genes that may be potentially affected by structural differences between chimpanzee and human (http://www.ncbi.nlm.nih.gov/RefSeq/). These 245 genes include members of a vast array of functional groups including those related to drug-detoxification, receptors, and reproduction. It is tempting to speculate that these functional groups have been good candidates for adaptive evolution since the divergence of humans and chimpanzees (Hollox et al. 2001; Gonzales et al. 2003). Over 71% (184/257) of these genes are in regions of chimpanzee deletions, and 78 genes contain 150 exons in humans that lack a corresponding high percent-identity match in the chimpanzee WGS sequences. More importantly, 23 of these genes show expression differences between humans and chimpanzees. Most of these genic regions are not well assembled in the current draft assembly. We recommend that such regions be prioritized for high-quality, clone-based sequencing. In summary, our data establish the fosmid paired-end mapping strategy as a robust and accurate method for detecting mid-, and large-scale structural variation between humans and other primates. This method gives a high-resolution estimate of coordinates within ~40 kb of the breakpoints of duplications, deletions, and inversion that are both too small to be detected by traditional cytogenetic analyses or too large to be reliably ascertained by comparisons of unfinished, low-quality, or low-coverage genomes to the human assembly (Pinkel et al. 1998; Snijders et al. 2001; Locke et al. 2003b). Our technique is also capable of detecting large-scale SVs and has yielded results that correspond well to all nine of the previously identified macro-inversions between humans and chimpanzee (Yunis et al. 1980; Yunis and Prakash 1982). In addition, our analysis identifies ~245 genes that are potentially rearranged or deleted between the two species. Extensive future experimental study is required to demonstrate functional significance of any genes and their role in contributing to phenotypic difference between humans and chimpanzees. Methods Fosmid paired-end sequence placement During the sequencing of the chimpanzee genome, a fosmid library (CHORI-1251) was constructed from peripheral blood obtained from the male chimpanzee genome sequence donor (Clint). The fosmid vector was chosen because of the insert stability, tight distribution of insert size, and the relatively low frequency of propagation errors when compared with other conventional cloning vectors (Kim et al. 1992). We obtained both the sequence and corresponding base quality for all traces from Washington University (http://www.ncbi.nlm.nih.gov/Traces/trace.cgi?), which yielded 1,788,428 end sequences (1,839,144,838 bp excluding “N”s) representing 866,328 nonredundant clones. Of these, we found 729,218 clones with trace sequences for both fosmid ends. All fosmid end sequences were optimally aligned and paired against both the reference human genome sequence and against chimpanzee chromosome 22 as part of a four-step process to detect putative rearrangements: (1) initial recruitment, (2) optimal realignment with quality rescoring, (3) determination of paired-end read placements, and (4) rearrangement detection. Initial recruitment During the recruitment phase all fosmid end sequences were aligned using NCBI Megablast (-p 80 -s 90 -v 7 -b 7 -w 12 -t 21) to the finishing reference human genome assembly (build34, July 2003). The score threshold (-s 90) was set to detect all alignments of ≥150 bp and ≥90% identity. A score cutoff allowed for the flexibility to detect shorter alignments with higher similarity or longer alignments with lower sequence identity, such as those due to base-calling errors in poor-quality traces. Additionally, an 80% identity threshold (-p 80) was set to avoid recruiting numerous pairwise alignments representing related transposable/repetitive elements. To capture all truly orthologous alignments while decreasing noise associated with more recently transposed repetitive sequences, only the alignments from the top seven scoring genomic reference fragments (and up to eight alignments within each genomic fragment) were retained. In total, 698,559 of the 866,328 clones (80.6%) with trace sequence for both ends were also high-quality sequence at both ends (30 bases of Phred Q 30). Of these 698,559 possible clones, 689,403 had recruitment of both ends with each end having one or more alignments. The remaining clones (<1%, or 9156/698,559) failed to align to human sequence at either end. Optimal realignment with quality rescoring All recruited alignments were then optimally realigned using an in-house Needleman-Wunsch implementation (match = +10, mismatch = -8, gap opening = -20, gap extension = -1, no penalty for terminal gaps) (Needleman and Wunsch 1970). Global realignment improved the treatment of insertions, deletions, and substitutions. The percent identity for each global alignment was then recalculated, base by base, including only those aligned bases where fosmid-end nucleotides were high-quality (Phred Q score 30, which equals a sequencing error rate of 10-3 per base) (Ewing and Green 1998). All reference genome sequence was considered high-quality, as published reports demonstrate extremely low error rates of <10-4 to 10-5 per base (International Human Genome Sequencing Consortium [IHGSC] 2001, 2004). A new alignment score, weighted for orthologous levels of identity, was then calculated based on the number of aligned bases and fraction identity (Global Alignment Score = base pairs × [2 × identity - 20 - [1 - identity]]). Alignments for each fosmid end were filtered to remove relatively small, lower-scoring, low-identity alignments, which do not likely represent orthologous locations. Determination of best paired-end placements We examined all pairwise combinations of end sequences that passed our criteria. In order to establish appropriate length thresholds, we initially examined the distribution of in silico insert sizes based on the mapping of 6172 chimpanzee fosmid paired-end sequences against the unique portions of human chromosome 21 and chimpanzee chromosome 22. We determined that the insert size was tightly distributed around the mean (PTR 22: 37.2 ± 4.1 kb; HSA 22: 37.2 ± 4.3 kb). This distribution was maintained after alignment of 555,929 high-quality clones to the whole human genome (Fig. 1A Detection of rearrangements Putative rearrangements were first identified computationally when two or more independent discordant fosmid clones supported the same type of rearrangement at an overlapping genomic position. Specifically, relative to the reference genome, multiple discordant fosmids supported an insertion when their insert size was too small, a deletion when the insert size was too large, and an inversion when the ends were directly oriented, rather than inverted. The minimal region containing the rearrangement on the reference genome was defined for each rearrangement by the position of the most juxtaposed/interior end sequences of the discordant clones overlapping the genomic region. For each minimal region of rearrangement, we calculated the amount of gap sequence, segmental duplication, and coverage of concordant fosmids. We used separate secondary criteria for insertion/deletion events to reduce the rate of false positives. To verify deletions, we required a break in concordant coverage (i.e., the bases spanned by concordant clones) to provide support for the configuration represented in the human reference sequence. We also removed 10 regions from our final set because they contained fosmids that spanned >1 Mb of DNA but failed to meet the second criteria with a sufficient gap in concordant fosmid coverage (i.e., <50% of the total length of the span). For insertion regions, we required the presence of at least two flanking singletons (clones in which one end is a best match in the human genome and the other is unaligned), in the appropriate orientation, for verification. Sequence annotation, discrepant clones, and putative regions of rearrangement, based on the two primary and secondary requirements described above, were displayed together for each chromosome using parasight (http://humanparalogy.gs.washington.edu/parasight/; Supplemental Figs. 3–26). During our analysis of discordant fosmids, we identified 37 regions where fosmid pairs span gaps within the sequence assembly. While such clones may be informative in directing gap closure, it is less likely that they represent true sites of structural variation due to the difficulties of accurately estimating gap sizes (IHGSC 2004). During this analysis we also identified 163 sites where the beginning and end positions of two different fosmids mapped within 20 bp of one another. We conservatively classified these as library amplification events (i.e., clonal propagates) and excluded these from further analysis. After elimination of clonal propagation and other assembly artifacts, we identified 651 sites of putative structural variation, corresponding to 293 chimpanzee deletions, 184 insertions, and 174 inversions (Fig. 5 Permutation testing and generation of random distributions We randomly sampled 40 genes from the set of ~35,000 genes tested for expression differences between humans and chimpanzees (Khaitovich et al. 2005). We repeated this sampling procedure (n = 10,000), recording the proportion of genes showing increased or decreased expression in chimpanzee. The mean percentages for all 10,000 iterations showing under- or overexpression in chimpanzee were 22% and 14%, consistent with the entire data set (n = 35,000 genes). The proportion of genes overlapping with chimpanzee deletions showing reduced expression (17/40) was significantly increased (p <0.003) when compared with the total gene set. To test the distribution of our insertion and deletion events across the genome, we randomly simulated the placement of 447 insertion and deletion events of equivalent size within the human reference assembly and recorded the distances between each event and its closest neighbor. We replicated this simulation of randomly placed events 10,000 times. We compared the distribution of these distances to the distribution of distances between closest neighbors found in the observed data set of 447 events and established a probability distribution (Poisson) for the number of events occurring by chance within a given amount of sequence. Chimpanzee WSSD comparison We implemented the WSSD duplication detection strategy, which measures the depth of coverage of random WGS sequence data against the human reference sequence to identify duplicated sequence in chimpanzee (>94% and >20 kb in length) (Cheng et al. 2005). Microarray comparison Gene expression differences between human and chimpanzee were assessed as described (Khaitovich et al. 2005). Briefly, five tissues (heart, brain, liver, kidney, and testis) were compared among five chimpanzee and six human individuals using Affymetrix® HG U133plus2 arrays. Eleven probes for each gene were chosen. All probes with significant difference in hybridization efficiency between humans and chimpanzees were excluded by first estimating the relative binding efficiency for each probe in the probe set by comparing the signal intensity of this probe to the intensities of all other probes within a probe set. We then compared the calculated binding efficiencies of the probes between all human and all chimpanzee samples using a t-test. If the binding efficiency of a probe differed significantly between human and chimpanzee samples (p <0.001), the probe was masked. Since this algorithm does not rely on actual sequence comparison, probes with different binding efficiencies caused by sequence differences in any copy of the gene will be masked (Khaitovich et al. 2004). Differentially expressed transcripts were defined as those which met the following criteria: (1) The corresponding probe set had to be expressed in all individuals from at least one species (detection p-value <0.065), and (2) the corresponding probe set had to show a change in expression in the same direction in all 30 pairwise comparisons. These cut-offs correspond to a false discovery rate ≤1.0% in all five tissues, estimated from 10,000 random permutations of sample labels. PCR analyses and genomic hybridization Oligonucleotides were designed within conserved sequence flanking sites of structural variation. PCR amplification conditions were as follows: Initial denaturation for 5 min at 95°C, “touchdown” from 65°Cto55°C, (60 sec 95°C, 60 sec 65°C, 60 sec 72°C, decreasing 1°C/cycle for 10 cycles), followed by 35 additional cycles of 60 sec 95°C, 60 sec 55°C, 60 sec 72°C (oligonucleotide sequences are shown in Supplemental Table 5). DNA samples (in the order of the gel): JK1051A (Homo sapiens), GM17015 (Homo sapiens), CO551 (Pan troglodytes), SFBR-4X0396 (Pan troglodytes), SFBR-4X0430 (Pan troglodytes), SFBR-4X0429 (Pan troglodytes), NG05253 (Pan paniscus), NG05251 (Gorilla gorilla), EEE-0002PPY (Pongo pygmaeus), SFBR-8320 (Papio hamadryas), NAO363446 (Macaca mullata). For Southern hybridizations, primate DNA (human [Homo sapiens], ELGP18; common chimpanzee [Pan troglodytes], AG16618, NA03448, NA03450, and NG06939; bonobo [Pan paniscus], LB501A and LB502A; gorilla [Gorilla gorilla], EEE0001GG0 and NG05251; orangutan [Pongo pygmaeus], EEE0003PPY and EEE0004PPY) was restriction enzyme-digested, transferred to nylon membrane, and hybridized as described previously (Yohn et al. 2005). Human PCR amplicons (see Supplemental Table 5 for PCR oligonucleotide sequence and conditions) corresponding to indels were used as radioactive probes. RT-PCR validation of IL1F7 RNA purification was performed on peripheral blood samples according to standard protocol of the TriZol purification kit (Invitrogen Life Technologies, #155 96–026). Synthesis of cDNA from RNA was performed according to standard protocols of the ProtoScript cDNA synthesis kit (New England Biolabs, #E6500S). PCR amplification conditions were as follows: Initial denaturation for 2 min at 94°C, followed by 35 additional cycles of 60 sec 94°C, 30 sec 60°C, 30 sec 72°C (oligonucleotide sequences shown in Supplemental Table 5). Samples tested (in order of appearance on gel): gorilla (Gorilla gorilla, 465); bonobo (Pan paniscus, LB502); human (Homo sapiens, EEE0007HSA and EEE0008HSA); and chimpanzee (Pan troglodytes, BC450 and BC449). FISH Fluorescent in situ hybridization was used to validate potential inversions between human and chimpanzee. Human RP11-BACs (based on end-sequence map positions within the July 2004 UCSC human genome browser) were selected corresponding to non-duplicated human sequence on either side of the inversion breakpoint (Supplemental Table 3). Metaphase and interphase nuclei were hybridized (Horvath et al. 2000), and bicolor and tricolor FISH experiments were compared between chimpanzee and human chromosomes. A disruption in continuity and order of probes between the two species was taken as evidence of a true inversion. FISH experiments that showed colinearity of the markers and additional interphase or metaphase nuclei were scored as chimpanzee segmental duplications. At least 10 metaphases were examined for each experiment, and chromosome identity was established using standard DAPI staining according to the guidelines of the International Standard for Cytogenetic Nomenclature (ISCN 1985). [Supplemental Reseach Data]
Acknowledgments The authors would like to thank Dr. Jeff Bailey, Dr. Andy Sharp, Dr. Devin Locke, and Sierra Hansen for helpful comments and discussion regarding this manuscript. Non-human primate materials used in the research were provided in part by the Southwest National Primate Research Center (P51-RR013986). We thank the Chimpanzee Sequencing and Analysis Consortium for access to the chimpanzee genome assembly, WGS trace, and finished sequence data prior to publication. We are grateful to Jerilyn Pecotte, Steve Warren, and Jeffrey Rogers for providing some of the primate material used in this study. E.E.E. is an investigator of the Howard Hughes Medical Institute. This work was supported in part by NIH grants GM58815 and HG002385 to E.E.E. In addition, the authors gratefully acknowledge CEGBA (Centro di Eccellenza Geni in campo Biosanitario e Agroalimentare), MIUR (Ministero Italiano della Università e della Ricerca; Cluster C03, Prog. L.488/92), the European Commission (INPRIMAT, QLRI-CT-2002–01325), and the BMBF (Bundesministerium für Bildung und Forschung) for financial support. Notes Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.4338005. Article published online before print in September 2005. Footnotes [Supplemental material is available online at www.genome.org. The following individuals kindly provided reagents, samples, or unpublished information as indicated in the paper: The Southwest National Primate Research Center, The Chimpanzee Sequencing and Analysis Consortium, Jerilyn Pecotte, Peter Parham, Steve Warren, and Jeffrey Rogers.] References
WEB SITE REFERENCES
|
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||||
Am J Hum Genet. 1999 Jan; 64(1):18-23.
[Am J Hum Genet. 1999]Chromosoma. 1973; 43(4):423-44.
[Chromosoma. 1973]J Reprod Fertil Suppl. 1980; Suppl 28():105-11.
[J Reprod Fertil Suppl. 1980]Science. 1980 Jun 6; 208(4448):1145-8.
[Science. 1980]Science. 1982 Mar 19; 215(4539):1525-30.
[Science. 1982]Nature. 2002 Dec 5; 420(6915):520-62.
[Nature. 2002]Am J Hum Genet. 1999 Jan; 64(1):18-23.
[Am J Hum Genet. 1999]Nat Rev Genet. 2002 Jan; 3(1):65-72.
[Nat Rev Genet. 2002]Nature. 2005 Sep 1; 437(7055):69-87.
[Nature. 2005]Nat Rev Genet. 2004 May; 5(5):345-54.
[Nat Rev Genet. 2004]Cold Spring Harb Symp Quant Biol. 2003; 68():455-60.
[Cold Spring Harb Symp Quant Biol. 2003]Nature. 2005 Sep 1; 437(7055):69-87.
[Nature. 2005]Am J Hum Genet. 2001 Feb; 68(2):444-56.
[Am J Hum Genet. 2001]Am J Hum Genet. 2002 Jun; 70(6):1490-7.
[Am J Hum Genet. 2002]Nature. 2005 Sep 1; 437(7055):69-87.
[Nature. 2005]Science. 2004 Jul 23; 305(5683):525-8.
[Science. 2004]Am J Hum Genet. 2005 Jul; 77(1):78-88.
[Am J Hum Genet. 2005]Science. 1980 Jun 6; 208(4448):1145-8.
[Science. 1980]Science. 1982 Mar 19; 215(4539):1525-30.
[Science. 1982]Am J Hum Genet. 2002 Aug; 71(2):375-88.
[Am J Hum Genet. 2002]Hum Mutat. 2005 Jan; 25(1):45-55.
[Hum Mutat. 2005]Genomics. 2005 May; 85(5):542-50.
[Genomics. 2005]Genome Res. 2003 Mar; 13(3):347-57.
[Genome Res. 2003]PLoS Biol. 2004 Jul; 2(7):E207.
[PLoS Biol. 2004]Nat Rev Genet. 2004 May; 5(5):345-54.
[Nat Rev Genet. 2004]Nature. 2005 Sep 1; 437(7055):69-87.
[Nature. 2005]Chromosoma. 1973; 43(4):423-44.
[Chromosoma. 1973]J Reprod Fertil Suppl. 1980; Suppl 28():105-11.
[J Reprod Fertil Suppl. 1980]Science. 1980 Jun 6; 208(4448):1145-8.
[Science. 1980]Science. 1982 Mar 19; 215(4539):1525-30.
[Science. 1982]Am J Hum Genet. 1999 Jan; 64(1):18-23.
[Am J Hum Genet. 1999]Genome Biol. 2003; 4(8):R50.
[Genome Biol. 2003]Genome Res. 2003 Mar; 13(3):347-57.
[Genome Res. 2003]J Med Genet. 2004 Mar; 41(3):175-82.
[J Med Genet. 2004]PLoS Biol. 2004 Jul; 2(7):E207.
[PLoS Biol. 2004]Science. 1999 Oct 29; 286(5441):964-7.
[Science. 1999]Gene. 2002 Jan 23; 283(1-2):1-10.
[Gene. 2002]Nat Genet. 2004 Aug; 36(8):861-6.
[Nat Genet. 2004]Science. 2004 Jul 23; 305(5683):525-8.
[Science. 2004]Am J Hum Genet. 2005 Jul; 77(1):78-88.
[Am J Hum Genet. 2005]Am J Hum Genet. 2001 Jan; 68(1):160-172.
[Am J Hum Genet. 2001]J Infect Dis. 2003 Aug 1; 188(3):397-405.
[J Infect Dis. 2003]Nat Genet. 1998 Oct; 20(2):207-11.
[Nat Genet. 1998]Nat Genet. 2001 Nov; 29(3):263-4.
[Nat Genet. 2001]Genome Res. 2003 Mar; 13(3):347-57.
[Genome Res. 2003]Science. 1980 Jun 6; 208(4448):1145-8.
[Science. 1980]Science. 1982 Mar 19; 215(4539):1525-30.
[Science. 1982]Nucleic Acids Res. 1992 Mar 11; 20(5):1083-5.
[Nucleic Acids Res. 1992]J Mol Biol. 1970 Mar; 48(3):443-53.
[J Mol Biol. 1970]Genome Res. 1998 Mar; 8(3):186-94.
[Genome Res. 1998]Nature. 2001 Feb 15; 409(6822):860-921.
[Nature. 2001]Nature. 2004 Oct 21; 431(7011):931-45.
[Nature. 2004]Nature. 2004 Oct 21; 431(7011):931-45.
[Nature. 2004]Nature. 2005 Sep 1; 437(7055):88-93.
[Nature. 2005]Genome Res. 2004 Aug; 14(8):1462-73.
[Genome Res. 2004]Hum Mol Genet. 2000 Jan 1; 9(1):113-23.
[Hum Mol Genet. 2000]Science. 1980 Jun 6; 208(4448):1145-8.
[Science. 1980]Science. 1982 Mar 19; 215(4539):1525-30.
[Science. 1982]Cytogenet Genome Res. 2005; 108(1-3):91-7.
[Cytogenet Genome Res. 2005]Science. 1982 Mar 19; 215(4539):1525-30.
[Science. 1982]Science. 1980 Jun 6; 208(4448):1145-8.
[Science. 1980]