![]() | ![]() |
Formats:
|
||||||||||||||||||||||||
Copyright © 2005, Cold Spring Harbor Laboratory Press Hotspots of mutation and breakage in dog and human chromosomes MRC Functional Genetics Unit, Department of Human Anatomy and Genetics, University of Oxford, Oxford OX1 3QX, United Kingdom 1Corresponding author. E-mail caleb.webber/at/human-anatomy.oxford.ac.uk; fax 44-1865-282651. Received March 2, 2005; Accepted June 16, 2005. This article has been cited by other articles in PMC.Abstract Sequencing of the dog genome allows an investigation of the location-dependent evolutionary processes that occurred since the common ancestor of primates and carnivores, ~95 million years ago. We investigated variations in G+C nucleotide fraction and synonymous nucleotide substitution rates (Ks) across dog and human genomes. Our results show that dog genes located either in subtelomeric and pericentromeric regions, or in short synteny blocks, possess significantly elevated G+C fraction and Ks values. Human subtelomeric, but not pericentromeric, genes also exhibit these elevations. We then examined 1.048 Gb of human sequence that is likely not to have been located near a primate telomere at any time since the common ancestor of dog and human. We observed that regions of highest G+C or Ks (“hotspots”; median sizes of 0.5 or 1.3 Mb, respectively) within this sequence were preferentially segregated to dog subtelomeres and pericentromeres during the rearrangements that eventually gave rise to the extant canine karyotype. Our data cannot be accounted for solely on the basis of gradually elevating G+C fractions in subtelomeric regions as a consequence of biased gene conversion. Rather, we propose that high G+C sequences are found preferentially within dog subtelomeres as a direct consequence of chromosomal fission occurring more frequently within regions elevated in G+C. Over their evolution, genome sequences accumulate small-scale nucleotide substitutions, insertions, and deletions, and larger-scale rearrangements and translocations. Each effect has acted differentially among genomes from diverse species, and between and within chromosomes of the same species (Wolfe et al. 1989; Matassi et al. 1999; Mouse Genome Sequencing Consortium 2002). Our understanding of the rates of, and correlations among, these evolutionary processes has greatly benefited from comparative analyses of human, mouse, and rat genomes (Mouse Genome Sequencing Consortium 2002; Rat Genome Sequencing Project Consortium 2004). However, mouse and rat genomes are highly derived with respect to that of the common ancestor of eutherian mammals (CAE), both in terms of their highly rearranged genomes, and in the relatively large number of nucleotide substitutions in selectively neutral sites they have accumulated. In contrast, the karyotype of the CAE (2n = 46) has since been altered by only 12 rearrangements and one reciprocal translocation along the primate lineage to humans (Wienberg 2004). Thus, compared with the derived karyotypes of murid rodents, the human karyotype is relatively ancestral. Chromosomal rearrangements, such as inversions and translocations, require double-stranded breaks. These were first supposed to occur randomly in chromosomes (Nadeau and Taylor 1984), although examinations of conserved synteny blocks in human and mouse chromosomes now indicate the presence of fragile regions that possess high propensities for breakage (Pevzner and Tesler 2003). Although others have challenged this conclusion (Trinh et al. 2004), the existence of fragile sites is consistent with observations that non-B DNA in human chromosomes is particularly susceptible to deletion events in human disease (Bacolla et al. 2004). Nucleotide substitution rates at neutral sites are greatly affected by their sequences' CpG content, since methylated cytosine in a CpG dinucleotide is hypermutable (Cooper and Youssoufian 1988; Sved and Bird 1990). These rates are also strongly correlated with the G+C fraction (Mouse Genome Sequencing Consortium 2002; Hardison et al. 2003; Fryxell and Moon 2004). The G+C fraction of isochores, long (>300 kb) regions of relatively homogeneous base composition (Filipski et al. 1973; Bernardi et al. 1985), appears to have declined significantly prior to the appearance of the first boreoeutherian animal (Belle et al. 2004), the earliest common ancestor of all primates, rodents, and carnivores (Springer et al. 2003). However, isochore G+C values in primate and carnivore lineages appear to have declined only slightly thereafter (Belle et al. 2004). In primate and rodent genomes, local rates of recombination are known to be positively correlated with both G+C fraction (Eyre-Walker 1993; Fullerton et al. 2001; International Human Genome Sequencing Consortium 2001) and neutral rates (Lercher and Hurst 2002; Mouse Genome Sequencing Consortium 2002; Hardison et al. 2003; Hellmann et al. 2003). These correlations have prompted speculation that recombination is mutagenic, and that its increase drives elevation of G+C fractions (Eyre-Walker 1993; Hardison et al. 2003; Meunier and Duret 2004), perhaps by “biased gene conversion” (BGC) (Lamb 1986; Brown and Jiricny 1987, 1988; Eyre-Walker 1993; Galtier et al. 2001; Marais 2003). In this BGC model, chromosomal locations where recombination is highest (such as human subtelomeric regions) (Kong et al. 2002) should, over time, increase their G+C contents while decreasing nucleotide substitution rates (Marais 2003). We recently observed a hitherto unforeseen effect of subtelomeric location on nucleotide substitution rates. We compared chicken gene sequences with their human orthologs in terms of Ks, the number of silent (synonymous) nucleotide substitutions per synonymous site, an estimate of the neutral substitution rate (International Chicken Genome Sequencing Consortium 2004). We found that the average Ks value for genes on the small avian chromosomes (“microchromosomes”) was significantly elevated compared with the average Ks value for genes on the larger chromosomes (“macrochromosomes”). We reasoned that this effect might have been driven by BGC, with a substitution rate increase for recombination-susceptible sequences close to telomeres. Such a rate increase has previously been observed for subtelomeric genes in Saccharomyces cerevisiae (Winzeler et al. 2003), although not for Caenorhabditis nematode genes (Stein et al. 2003). Indeed, we found elevated Ks values in regions <10 Mb from the ends of assembled macrochromosomes, to a level that was indistinguishable from Ks values obtained from genes in the chicken microchromosomes. We proposed that the microchromosomes' elevation in Ks values is a direct result of these chromosomes being deficient in genes that are located distant from telomeric ends. The availability of the dog (Canis familiaris) genome sequence (Lindblad-Toh et al. 2005) now enables a fresh perspective to be gained on the correlations and causal relationships between evolutionary rates, recombination, G+C fractions, and physical location. The karyotype (2n = 78) of the dog, C. familiaris, is substantially rearranged with respect to the CAE. From chromosome painting experiments of carnivores, it appears that the high-numbered acrocentric karyotypes of extant canids (2n = 74-78) arose from a fragmentation of the ancestral carnivore karyotype (2n = 42), mostly involving pericentric inversions followed by the fission of chromosomes at their centromeres (Nash et al. 2001; Wienberg 2004). The dog genome sequence presents an opportunity both to reconstruct evolutionary events on the lineage to dog and to infer with greater precision past variations in base composition and evolutionary rates. In particular, the derived dog karyotype allows us to test the hypothesis that sequences proximal to a telomere experience elevated nucleotide substitution rates because of higher recombination and BGC rates, thereby causing a concomitant increase in the fraction of G+C nucleotides. This hypothesis predicts that dog genes located near recently derived telomeres have accumulated considerably greater G+C content at fourfold degenerate sites than their human orthologs, which have been located far from a telomere since the last common ancestor of dog and human. Similarly, genes located in human subtelomeres would be expected to have increased their G+C fraction at fourfold degenerate sites relative to dog nonsubtelomeric orthologs. Our results show that dog genes that are located in either subtelomeric or pericentromeric regions possess elevated G+C nucleotide fractions and synonymous rates relative to dog interstitial genes. Moreover, the rank order correlation of G+C at fourfold degenerate sites of dog and human orthologs remains high despite these rearrangements. While we conclude that recombination-driven BGC has occurred in the vicinity of dog telomeres, no evidence for this was found near the more ancestral human telomeres. We considered an alternative hypothesis, that chromosome fission during the fragmentation of the canid karyotype occurred preferentially within ancestrally high G+C regions. This model provides two testable predictions: (1) that G+C bias in chromosome breakage contributed to the observed elevations in G+C fractions within dog subtelomeres and pericentromeres; and (2) that high G+C regions suffered more numerous breakages and thus were preferentially segregated to shorter synteny blocks. Each of these predictions is supported by dog and human genome comparisons. We thus propose a high G+C “fragile breakage” model for the evolution of the canine karyotype. Results Data We considered two sets of orthology relationships representing single, nonduplicated and omni-present orthologous genes in mammalian genomes. The first ortholog data set, which we refer to as D5, contains single orthologs in five species, namely, chicken, mouse, rat, human, and dog. D5 consists of data derived for the chicken genome sequencing project (International Chicken Genome Sequencing Consortium 2004) augmented by 7670 dog genes that have single orthologs in the other four species and that we predicted from the dog genome assembly (see Methods). A second ortholog data set, termed D2, consists of 13,738 single (1:1) orthologous genes in dog and human genomes, and was derived from phylogenetic analyses (L. Goodstadt and C.P. Ponting, in prep.). Data set D2 facilitated chromosomal mapping of quantities between dog and human genomes. Fourfold degenerate sites (so-called 4D sites) at the third position of codons encoding eight amino acid types were identified as previously (Hardison et al. 2003). GC4D, the fraction of G or C bases at these sites, was calculated for D5 and D2 orthologs, as was GC53, the G+C fraction for 10 kb 5′-upstream and 3′-downstream of transcriptional start and stop sites. The physical distance of a gene to the nearest telomeric end of assembled chromosomal sequence (without spanning the centromere) was assumed to approximate well the true distance to the chromosome's telomere. Similarly, the distance to the centromere was assumed to be the minimum number of bases between the gene and centromere coordinates taken from the UCSC table browser (Karolchik et al. 2004). The neutral rate of nucleotide substitution was assumed to equal Ks, the number of synonymous substitutions per synonymous site, as estimated using codeml from Yang's PAML package (Yang 1997). Variations in GC4D, GC53, and Ks were noted for all dog (Supplemental Fig. 1) and human (Supplemental Fig. 2) autosomes. We show these variations for a representative chromosome (Chromosome 1) in each species in Figures Figures11
G+C fraction and Ks As seen for other mammalian genomes (Mouse Genome Sequencing Consortium 2002; Rat Genome Sequencing Project Consortium 2004), the G+C fraction varies greatly across the dog genome (Fig. 1 G+C fractions are known to correlate significantly between human and murid rodent orthologs (Mouchiroud et al. 1988; Mouse Genome Sequencing Consortium 2002). Among pairs of the five species (data set D5) investigated, we found that GC4D is most correlated between dog and human (Table 1). Surprisingly, it is marginally more correlated between these two species, which diverged ~95 million years ago (Mya) (Springer et al. 2003), than between rat and mouse lineages that share a considerably more recent ancestor (~15 Mya) (Springer et al. 2003).
Median Ks values between these species' pairs reveal that, on average, fewer nucleotide substitutions have accumulated at silent sites in the lineages to dog and human (median Ks = 0.36), than they have to the lineages to mouse and human (median Ks = 0.60) (Mouse Genome Sequencing Consortium 2002), despite the carnivore lineage being an out-group to both primates and rodents. This arises because of the well-known higher substitution rates in murid rodents than in other mammals (Mouse Genome Sequencing Consortium 2002). We further investigated the known positive correlation between neutral rates and G+C fraction (Mouse Genome Sequencing Consortium 2002; Hardison et al. 2003; Hellmann et al. 2003), among the five vertebrates. We find the ranked correlation to be greatest for dog and human orthologs (Table 2); among individual dog chromosomes, the correlation coefficient rises to a value of 0.80 (CFA6 and CFA31). Once more, the correlation is least for mouse and rat orthologs, despite these species sharing a more recent common ancestor.
Distance dependencies on G+C fraction and Ks Previously we observed for chicken-human gene alignments (International Chicken Genome Sequencing Consortium 2004) that G+C fractions and Ks values are substantially elevated in regions proximal to the ends of assembled chromosomes (“subtelomeric regions”). In the present study, we find elevations of G+C fractions (Table 3) and of Ks values (Table 4) in the subtelomeric regions of both dog and human chromosomes. In contrast, such elevations are barely perceptible for either rat or mouse chromosomes. Elevation of G+C within human subtelomeres likely accounts for the negative correlation of G+C fraction with chromosomal size (Duret et al. 2002).
Elevation of Ks values in subtelomeric regions is most pronounced in human subtelomeres, and is least in murid rodent subtelomeres (Table 4). Because of these greatly reduced distance dependencies, rodent G+C fractions and Ks data will not be considered further. These elevations are not maintained uniformly within subtelomeric regions. By plotting median values of GC4D and Ks for bins containing a minimum of 200 genes (data set D2), we observed that each of these quantities declines logarithmically from the ends of assembled dog or human chromosomes (Fig. 3
Elevations in these quantities, and their declinations over physical distance, are substantially more pronounced in human subtelomeres than they are in dog subtelomeres (Fig. 3 Unexpectedly, we also observed strong and significant declines in genes' GC4D and Ks values from dog centromeres, but not from human centromeres (Fig. 3 G+C fraction and Ks values are elevated in dog subtelomeric regions From the perspective of the recombination-driven BGC model, the elevations of these quantities in dog subtelomeric and pericentromeric regions were surprising. Each of the dog chromosomes has been formed in the last 60 million years (Myr) from a mosaic of two to four segments from chromosomes of the common ancestor of the carnivores (Nash et al. 2001), and thus these chromosomal ends have only been derived recently by large-scale chromosomal rearrangements. If, owing to recombination-driven BGC, rises in G+C fraction and Ks value occur relatively slowly over time, then it appeared unlikely that sequence has dwelt for sufficient time at the ends of dog chromosomes for this effect to have become so pronounced. If, in contrast, more rapid and substantial changes in base composition within high G+C regions occurred during the past ~100 Myr, then these are not consistent with either the high correlation of dog and human orthologs' GC4D values (ρ = 0.945) (Table 1), or with the findings of others that the human isochore structure is ancestral (Galtier and Mouchiroud 1998; Eyre-Walker and Hurst 2001; Belle et al. 2004). Thus, we do not expect rapid variation in base composition within high G+C regions at subtelomeres and pericentromeres. Significant coincidence of G+C “hotspots” and chromosomal breakpoints We considered whether these high G+C fraction regions were segregated to subtelomeres and pericentromeres during the fragmentation of the canid karyotype as a direct consequence of chromosomal fission occurring preferentially within such regions (the “fragile breakage” model) (Fig. 4
Consequently, we defined long-range maxima (hotspots) in either G+C fraction or Ks value, using a sliding window of 10 genes (see Methods); these are two to three orders of magnitude longer (Supplemental Table 1) than previously described recombination hotspots (Jeffreys et al. 2001). As expected from our previous findings (Fig. 3 However, we were most interested in the current chromosomal locations of dog genes whose single human orthologs have persisted in chromosomal interstitial regions throughout human history, since at least the CAE. In particular, we wished to track the location, in the dog genome, of human G+C fraction and Ks value local maxima within these interstitial regions. Thus, we first delineated 1.048 Gb of human sequence from the interstitial regions (>9 Mb from assembled chromosomal ends) of nine chromosomes, namely, HSA1, 5, 6, 9, 11, 13, 17, 18, and 20, which are known to have escaped major rearrangement (fusion or fission) since the CAE (Wienberg 2004). We refer to the human genes in such regions, and their canine single orthologs, as ancestral interstitial (AI) genes. In the following analysis we discarded all sequence except human and dog regions containing AI genes. Next, we identified 113 human hotspots containing AI genes that exhibit the highest 20% of GC4D windowed values in each chromosome (see Methods). By mapping these hotspots to canine chromosomes, we identified 24 of these 113 high G+C hotspots that have been relocated, during the evolution of the canid karyotype, to among 17 subtelomeric regions of dog chromosomes. Using a randomization model (see Methods), we found that this is a higher number than expected by chance (P = 0.037). Thus, assuming that extant dog telomeres derived from fissuring events, we infer that ancestral regions high in G+C had a significantly greater propensity for fissuring during the evolution of the canid karyotype. We also identified 15 human AI-gene-containing regions that are in conserved synteny with a dog pericentromeric region. Of these 15 regions, 12 (80%) are coincident with one or more of 113 human AI gene GC4D hotspots, which again is unexpected by chance alone (P = 2.7 × 10-6). Thus, our findings strongly suggest that ancestral interstitial hotspots have preferentially been rearranged to form extant subtelomeric and pericentromeric regions. Significant coincidence of Ks hotspots and chromosomal breakpoints We also performed a similar analysis applied to 66 human AI gene-containing regions characterized by significantly high Ks-value maxima (see Methods). As expected, given the strong correlations between dog and human G+C fractions, and G+C fraction and Ks value (Table 2), these Ks hotspots also are found preferentially both at dog subtelomeres and at pericentromeres: 14 of 66 high Ks value regions map to 17 dog subtelomeric regions (P = 5.8 × 10-6) and 9 of 66 high Ks value regions map to 15 dog pericentromeric regions (P = 6.4 × 10-5). The more significant coincidence of Ks peaks to fragile sites, over GC peaks, may in part be a consequence of the narrower biological variation in GC4D, compared to Ks, combined with fewer sampled sites, which acts to increase the sampling error and to limit G+C peak detection. G+C fraction and Ks values are elevated in short synteny blocks The fragile breakage model predicts that short synteny blocks in the dog genome exhibit higher GC4D and Ks values than longer blocks. This follows directly from our observations that chromosomal breakage in the dog lineage occurred preferentially in regions associated with high GC4D and Ks values. We thus investigated whether genes' GC4D and Ks values are correlated with the size of dog synteny blocks in which they are located. Indeed, we found significant negative correlations between synteny block size and either dog GC4D (Spearman's ρ = -0.217; P-value < 2.2 × 10-16) or Ks value (Spearman's ρ = -0.220; P-value < 2.2 × 10-16). Similar significant correlations (data not shown) were observed for correlations with dog GC53 or human GC4D or human GC53. We then investigated whether these quantities increase with distance toward a synteny breakpoint, in a manner similar to that seen for subtelomeric sequence (Fig. 3
Significant coincidence of ancestral high Ks value regions and high ΔGC4D regions We next considered a prediction of the recombination-driven BGC model that G+C fraction has increased more within hotspots than elsewhere. For this analysis, we calculated values of ΔGC4D, the GC4D fraction of a dog gene over and above that of its human ortholog. We then identified 84 regions of human AI-gene-containing sequence that exhibited the highest ΔGC4D values (see Methods); human genes in ΔGC4D maxima thus possess significantly lower GC4D values, on average, than their dog orthologs. In agreement with the recombination-driven BGC model, we find that of 32 dog pericentromeric and subtelomeric regions that are in conserved synteny with human AI-gene-containing regions, 18 coincide with one or more of these 84 ΔGC4D maxima; this number is more than expected by chance (P = 5.2 × 10-4). In contrast, we found no significant coincidence between ΔGC4D minima (representing regions where the human GC4D values, on average, exceed those of their dog orthologs) and dog pericentromeric and subtelomeric regions (P = 0.54). Similarly, we find that dog GC4D hotspots tend to have increased their G+C content, relative to their human orthologs, when they are located close to chromosomal ends. ΔGC4D is significantly elevated in the 20 Mb approaching dog telomeres (Spearman's ρ = 0.146, P-value < 2.2 × 10-16). However, contrary to the recombination-driven BGC model, ΔGC4D is not correlated (|ρ| < 0.05) with distances to human telomeres and centromeres; it is also not correlated with distance to dog centromeres. Thus, ancestral interstitial sequence that suffered a breakpoint and rearrangement during canid evolution is not only elevated in extant gene GC4D values, but this GC4D elevation in dog sequence significantly exceeds the elevation of its human orthologs. Obviously, these regions, which we propose have been substantially elevated in GC4D in the common ancestor of dog and human, either have reduced their GC4D values in the human lineage, or have increased their GC4D values in the canine lineage, or both. Discussion Our results demonstrate location-dependent effects on nucleotide composition and substitution rates in both human and dog genomes. We observed that GC4D and Ks values are significantly elevated within human subtelomeric regions, and that these elevations are greater than those seen for dog, mouse, and rat subtelomeric regions. These findings immediately suggest that human sequence, which has been relatively intransigent to chromosomal rearrangement, might have steadily increased its G+C fraction over time within subtelomeric regions, to a greater extent than seen for the subtelomeres of genomes, such as those of dog, mouse, and rat, which have suffered from substantial rearrangements. In this BGC scheme, the stability of the karyotype of human ancestors would have resulted in steadily increasing G+C fractions in subtelomeric regions that possess high rates of recombination (Duret et al. 2002). This is because recombination, when coupled to biased gene conversion, increases the incorporation of G or C bases at a mismatch site (Galtier et al. 2001). Higher G+C proportions have been linked to higher neutral rates mainly because of the hypermutability of the CpG dinucleotide (Cooper and Youssoufian 1988; Sved and Bird 1990), although this link has recently been questioned (Fryxell and Moon 2004). Such a scheme would be consistent with an origin of mammalian high G+C isochores from within stable subtelomeric regions in early chordate genomes. Importantly, our findings cannot all be explained by this recombination-driven BGC scheme. In particular, the scheme predicts higher G+C fractions among human subtelomeric genes than their dog orthologs (i.e., a positive correlation between ΔGC4D and distance from telomeric end), because the human karyotype is relatively ancestral whereas that of the dog is derived. Although such a significant positive correlation occurs, it is relatively small (Spearman's ρ = 0.045) and at a similar level to genes approaching human centromeres (ρ = 0.046), where recombination typically is suppressed (Kong et al. 2002). Moreover, this scheme predicts divergent GC4D values for human and dog orthologs that have been subjected to different amounts of recombination. Nevertheless, we observed an extremely high correlation (Spearman's ρ = 0.945) between human and dog orthologs' GC4D values (Table 1). This indicates that a gene's G+C fraction, for these two species, is an ancestral property, and thus has not altered substantially in rank order since their common ancestor ~95 Mya. An extant human subtelomeric gene that exhibits a high G+C fraction is likely to have possessed a relatively high G+C value in the genomes of the earliest boreoeutherian ancestor and its predecessors, regardless of its location in these ancestral chromosomes. This is supported by a previous observation that the average G+C fraction at the third codon positions (GC3) of 41 genes has reduced by only 2.3% for the primate lineage, and 2.1% for the carnivore lineage since the CAE (Belle et al. 2004), and by the similarity between average GC4D values for dog and human genes (0.590 and 0.570, respectively). If G+C fractions, and thus Ks rates, have not been elevating substantially and progressively within the subtelomeres of chromosomes in the human lineage, then what other evolutionary process might have caused such effects? We suggest an alternative (“fragile breakage”) scheme to account for our findings (Fig. 4 In support of this scheme, we found that extant dog subtelomeric and pericentromeric sequences arose preferentially by fission of chromosomal sequence that was ancestrally enriched in G+C bases. We identified significant correlations between subtelomeric and pericentromeric regions in the dog genomes, and their orthologous regions in human that possess elevated G+C and Ks values, despite these human regions being unlikely to have been located close to telomeres since the CAE. We also note that our proposal is consistent with studies of human chromosomes that show that breakpoints in human chromosomes occur preferentially in telomeres (Yu et al. 1978; Stoll 1980) and in G+C-rich G-light regions (Aula and von Koskull 1976; Nakagome and Chiyo 1976; Stoll 1980; Abeysinghe et al. 2003). The proposal may be valid for other genomes besides those of human and dog because double-stranded breaks are also known to occur predominantly in high G+C regions in the yeast Saccharomyces cerevisiae (Baudat and Nicolas 1997; Gerton et al. 2000). While we cannot formally discount the possibility that such breaks preferentially occur in high G+C regions as a result of a dependency on a third quantity with which G+C and Ks both covary, we believe it possible that nonrandom breakage occurs directly because of nonuniform base composition often acting over megabase scales. Finally, the fragile breakage model is entirely consistent with significant negative correlations between synteny block size and either GC4D or Ks values. Furthermore, we find significant negative correlations between either G+C content or Ks and distance to a synteny breakpoint (Fig. 5 Recent findings have overturned the random-breakage model of chromosomal evolution (Nadeau and Taylor 1984). Breakpoints often appear to be clustered, implying “reuse” of breakpoints, in a model termed fragile breakage (Pevzner and Tesler 2003; Bailey et al. 2004). One indicator of fragility is the occurrence of segmental duplication in orthologous sequence (Bailey et al. 2004). Our findings suggest that fragility is also associated with high G+C fraction. If so, high G+C regions appear not only to be highly susceptible to nucleotide substitution, insertions, deletions, and recombination (Hardison et al. 2003; Taylor et al. 2004), but also to chromosomal breakage. Our results do not counter the hypothesis of G+C elevation at subtelomeric regions due to increased recombination and biased gene conversion, although we observed scant evidence of such elevation within human subtelomeres. Rather, they highlight correspondences between G+C and Ks hotspots and the evolution of canid chromosomes, and represent consequences of mutational processes that have shaped the canid, and perhaps other, karyotypes. We have shown that G+C hotspots in the common ancestor of dog and human either have reduced their GC4D values in the human lineage, or have increased their GC4D values in the canine lineage, or both. Further studies of the G+C changes in the mammalian lineage are likely to reveal the relative contributions of these two evolutionary processes to G+C fraction elevations within subtelomeres, pericentromeres, and fragile breakpoints. Methods Gene sets We used two gene sets for our analyses. The first (denoted D5) was a set of 6800 1:1:1:1:1 chicken:human:mouse:rat:dog orthologs. This consisted of a set of 8164 1:1:1:1 chicken:human: mouse:rat orthologs, as described previously (International Chicken Genome Sequencing Consortium 2004), augmented with their predicted single dog orthologs. The four-way International Chicken Genome Sequencing Consortium set consists of Ensembl genes (Hubbard et al. 2005) based on the Homo sapiens NCBI34, Mus musculus NCBI30, Rattus norvegicus Baylor v2.1, Gallus gallus (WUSTL Feb. 2004 release). Dog orthologs were predicted by first aligning all transcripts between each human and mouse 1:1 ortholog pair using BLAST (Altschul et al. 1997). Next, the human transcript that aligned with the highest bit-score density (bit-score per aligned length) was used to query the dog genome (Broad v1), initially with Exonerate (Slater and Birney 2005), and refined subsequently with GeneWise (Birney et al. 2004). In all, 17,598 predictions representing 8066 queries were returned. Where predictions overlapped by >20% of their length, the prediction with the highest GeneWise score was retained; 12,167 predictions representing 7988 queries remained. The Ks value between each top hit and the initial human query was calculated with Codeml from the PAML package (Yang 1997; http://abacus.gene.ucl.ac.uk/software/paml.html), essentially as previously (Mouse Genome Sequencing Consortium 2002) (two genes were removed at this stage as a result of alignment problems). Likely dog processed pseudogenes were identified as intron-less predictions where the human template possesses at least one intron 10 or more codons from either translational end; subsequently, these were removed. The highest scoring transcript for each of the remaining 7751 genes was then added to the four-way International Chicken Genome Sequencing Consortium set to form the 1:1:1:1:1 D5 orthology set. In all but 15 of these dog predicted orthologs, these proved to represent reciprocal-best-BLASTp-hits to their predicted human orthologs' sequences. The median human-dog orthologs' Ks value for the set was 0.36, and the median amino acid percentage identity was 92.75%; these values are similar to those obtained by others (Lindblad-Toh et al. 2005). Where positional information was required for all genes within the set, 951 orthologous quintuplets containing one or more genes located on an unassembled chromosome were removed. A second ortholog set (denoted D2) was obtained from phylogeny-based orthology predictions by L. Goodstadt and C.P. Ponting (in prep.) between Ensembl genes based on the H. sapiens NCBI35 and C. familiaris (Broad v1) genomes. The D2 set consisted of 13,747 orthologous gene pairs. This number is reduced to 11,713 pairs where placement on an assembled chromosome for both orthologs is required. As above, all Ks values were estimated using Codeml (Yang 1997). The median human-dog orthologs' Ks value for the D2 set was 0.372, the median Ka/Ks ratio was 0.107, and the median amino acid percentage identity was 89.4%. Maxima (hotspots) determination G+C nucleotide fractions were calculated both at fourfold degenerate (4D) sites (GC4D) and at sites 10 kb upstream and downstream of the transcriptional start site (GC53). In order to identify regional maxima in G+C fraction values, a sliding window containing 10 D2 genes was translated across each assembled dog and human autosome. Variations in quantities were examined for autosomes only, because mammalian X-chromosomes have persisted without fusion or fission at least since the CAE (Kohn et al. 2004). Within each window the median GC4D, the median Ks value between dog and human orthologs, the GC4D values of their orthologs, and the median difference between the genes' GC4Ds were recorded, along with the window's location, defined as the mean position of its genes' midpoints between transcriptional start and end bases. Ks and ΔGC4D values' maxima (hotspots) were defined using similar procedures. First, for each chromosome the mean Ks or ΔGC4D value averaged over all gene windows was calculated. An initial set of hotspots was then defined as the locations of gene windows whose median Ks or ΔGC4D values were 2.0 standard deviations greater than the chromosomal mean; for normal distributions this threshold delineates the highest 2.3% of the data. Results (data not shown) obtained at higher thresholds, namely, 2.5 and 3.0 standard deviations, were similar to those described here. The resulting maxima were then aggregated by requiring at least three consecutive windows whose values all lay above threshold, and adjacent bins were amalgamated to form broader hotspots. An additional set of more localized peaks was defined by repeating this procedure using a sliding 20-Mb window across each chromosome. The resulting hotspots from both procedures were then combined. Summary statistics for the size distributions of these peaks are provided in Supplemental Table 1. The larger localized variance (see Figs. Figs.11 Conserved synteny Conserved synteny was defined using dog versus human 500-kb synteny maps obtained from the Dog Genome Sequencing Consortium (Lindblad-Toh et al. 2005). Human subtelomeres were defined from the relevant genome assembly as regions within 5 Mb of the end of each autosome sequence. For each of the five human acrocentric chromosomes, only one subtelomeric region was defined. Regions within the dog genome exhibiting conserved synteny to human subtelomeres (<5 Mb from the assembled telomeric end) were also recorded. For the dog genome, pericentromeric regions were defined as regions <5 Mb from the proximal end of the autosomes, while subtelomeric regions were defined as regions <5 Mb from the quartal ends of autosomes. The position of each human genomic region in conserved synteny to a dog pericentromeric or subtelomeric region was recorded. Interstitial regions were defined as sequence >9 Mb from both assembled telomeric and centromeric ends. Significance of spatial coincidences The likelihood of observing at least a given number of overlaps between a pair of data sets was evaluated using randomized simulations and Z-scores. For each of the two data sets, a set of identically sized nonoverlapping fragments was drawn from an identical sample space, and the number of overlaps between the two randomized sets counted. This procedure was repeated 10,000 times for each test, and the distributions of randomized overlap frequencies checked for signs of kurtosis. A Z-score was then derived for the observed number of genomic overlaps, from which a normalized probability was calculated. Additional statistical analysis Statistical analysis not described above was performed using R (http://cran.r-project.org/). [Supplemental Research Data]
[Dog Genome Sequence]
Acknowledgments We are indebted to Ensembl, Kerstin Lindblad-Toh, Tarjei Mikkelsen, and others of the Dog Genome Sequencing Consortium, for assistance, and to Leo Goodstadt and Andreas Heger for helpful discussions. We thank all the reviewers, whose comments were invaluable in improving this manuscript. Notes Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.3896805. Footnotes [Supplemental material is available online at www.genome.org.] References
Web site references
|
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||||||||||
Nature. 1989 Jan 19; 337(6204):283-5.
[Nature. 1989]Curr Biol. 1999 Jul 29-Aug 12; 9(15):786-91.
[Curr Biol. 1999]Nature. 2002 Dec 5; 420(6915):520-62.
[Nature. 2002]Nature. 2004 Apr 1; 428(6982):493-521.
[Nature. 2004]Curr Opin Genet Dev. 2004 Dec; 14(6):657-66.
[Curr Opin Genet Dev. 2004]Proc Natl Acad Sci U S A. 1984 Feb; 81(3):814-8.
[Proc Natl Acad Sci U S A. 1984]Proc Natl Acad Sci U S A. 2003 Jun 24; 100(13):7672-7.
[Proc Natl Acad Sci U S A. 2003]Bioinformatics. 2004 Aug 4; 20 Suppl 1():i318-25.
[Bioinformatics. 2004]Proc Natl Acad Sci U S A. 2004 Sep 28; 101(39):14162-7.
[Proc Natl Acad Sci U S A. 2004]Hum Genet. 1988 Feb; 78(2):151-5.
[Hum Genet. 1988]Proc Natl Acad Sci U S A. 1990 Jun; 87(12):4692-6.
[Proc Natl Acad Sci U S A. 1990]Nature. 2002 Dec 5; 420(6915):520-62.
[Nature. 2002]Genome Res. 2003 Jan; 13(1):13-26.
[Genome Res. 2003]Mol Biol Evol. 2005 Mar; 22(3):650-8.
[Mol Biol Evol. 2005]Mol Biol Evol. 2001 Jun; 18(6):1139-42.
[Mol Biol Evol. 2001]Nature. 2001 Feb 15; 409(6822):860-921.
[Nature. 2001]Trends Genet. 2002 Jul; 18(7):337-40.
[Trends Genet. 2002]Nature. 2002 Dec 5; 420(6915):520-62.
[Nature. 2002]Genome Res. 2003 Jan; 13(1):13-26.
[Genome Res. 2003]Nature. 2004 Dec 9; 432(7018):695-716.
[Nature. 2004]Genetics. 2003 Jan; 163(1):79-89.
[Genetics. 2003]PLoS Biol. 2003 Nov; 1(2):E45.
[PLoS Biol. 2003]Cytogenet Cell Genet. 2001; 95(3-4):210-24.
[Cytogenet Cell Genet. 2001]Curr Opin Genet Dev. 2004 Dec; 14(6):657-66.
[Curr Opin Genet Dev. 2004]Nature. 2004 Dec 9; 432(7018):695-716.
[Nature. 2004]Genome Res. 2003 Jan; 13(1):13-26.
[Genome Res. 2003]Nucleic Acids Res. 2004 Jan 1; 32(Database issue):D493-6.
[Nucleic Acids Res. 2004]Comput Appl Biosci. 1997 Oct; 13(5):555-6.
[Comput Appl Biosci. 1997]Nature. 2002 Dec 5; 420(6915):520-62.
[Nature. 2002]Nature. 2004 Apr 1; 428(6982):493-521.
[Nature. 2004]J Mol Evol. 1988; 27(4):311-20.
[J Mol Evol. 1988]Nature. 2002 Dec 5; 420(6915):520-62.
[Nature. 2002]Proc Natl Acad Sci U S A. 2003 Feb 4; 100(3):1056-61.
[Proc Natl Acad Sci U S A. 2003]Nature. 2002 Dec 5; 420(6915):520-62.
[Nature. 2002]Nature. 2002 Dec 5; 420(6915):520-62.
[Nature. 2002]Genome Res. 2003 Jan; 13(1):13-26.
[Genome Res. 2003]Am J Hum Genet. 2003 Jun; 72(6):1527-35.
[Am J Hum Genet. 2003]Nature. 2004 Dec 9; 432(7018):695-716.
[Nature. 2004]Genetics. 2002 Dec; 162(4):1837-47.
[Genetics. 2002]Cytogenet Cell Genet. 2001; 95(3-4):210-24.
[Cytogenet Cell Genet. 2001]Genetics. 1998 Dec; 150(4):1577-84.
[Genetics. 1998]Nat Rev Genet. 2001 Jul; 2(7):549-55.
[Nat Rev Genet. 2001]J Mol Evol. 2004 Jun; 58(6):653-60.
[J Mol Evol. 2004]Nature. 2002 Dec 5; 420(6915):520-62.
[Nature. 2002]Genome Res. 2003 Jan; 13(1):13-26.
[Genome Res. 2003]Nat Genet. 2001 Oct; 29(2):217-22.
[Nat Genet. 2001]Curr Opin Genet Dev. 2004 Dec; 14(6):657-66.
[Curr Opin Genet Dev. 2004]Genetics. 2002 Dec; 162(4):1837-47.
[Genetics. 2002]Genetics. 2001 Oct; 159(2):907-11.
[Genetics. 2001]Hum Genet. 1988 Feb; 78(2):151-5.
[Hum Genet. 1988]Proc Natl Acad Sci U S A. 1990 Jun; 87(12):4692-6.
[Proc Natl Acad Sci U S A. 1990]Mol Biol Evol. 2005 Mar; 22(3):650-8.
[Mol Biol Evol. 2005]Nat Genet. 2002 Jul; 31(3):241-7.
[Nat Genet. 2002]J Mol Evol. 2004 Jun; 58(6):653-60.
[J Mol Evol. 2004]Am J Hum Genet. 2002 Oct; 71(4):695-714.
[Am J Hum Genet. 2002]Genome Res. 2004 Sep; 14(9):1696-703.
[Genome Res. 2004]Chromosome Res. 2004; 12(6):617-26.
[Chromosome Res. 2004]Cytogenet Cell Genet. 2001; 95(3-4):210-24.
[Cytogenet Cell Genet. 2001]Hum Hered. 1978; 28(3):210-225.
[Hum Hered. 1978]Hum Genet. 1980; 56(1):89-93.
[Hum Genet. 1980]Hum Genet. 1976 May 19; 32(2):143-8.
[Hum Genet. 1976]Am J Hum Genet. 1976 Jan; 28(1):31-41.
[Am J Hum Genet. 1976]Hum Mutat. 2003 Sep; 22(3):229-44.
[Hum Mutat. 2003]Cytogenet Cell Genet. 1997; 77(3-4):211-7.
[Cytogenet Cell Genet. 1997]Cytogenet Cell Genet. 1987; 45(3-4):218-21.
[Cytogenet Cell Genet. 1987]Nature. 2004 Dec 9; 432(7018):695-716.
[Nature. 2004]Proc Natl Acad Sci U S A. 1984 Feb; 81(3):814-8.
[Proc Natl Acad Sci U S A. 1984]Proc Natl Acad Sci U S A. 2003 Jun 24; 100(13):7672-7.
[Proc Natl Acad Sci U S A. 2003]Genome Biol. 2004; 5(4):R23.
[Genome Biol. 2004]Genome Res. 2003 Jan; 13(1):13-26.
[Genome Res. 2003]Genome Res. 2004 Apr; 14(4):555-66.
[Genome Res. 2004]Nature. 2004 Dec 9; 432(7018):695-716.
[Nature. 2004]Nucleic Acids Res. 2005 Jan 1; 33(Database issue):D447-53.
[Nucleic Acids Res. 2005]Nucleic Acids Res. 1997 Sep 1; 25(17):3389-402.
[Nucleic Acids Res. 1997]BMC Bioinformatics. 2005 Feb 15; 6():31.
[BMC Bioinformatics. 2005]Genome Res. 2004 May; 14(5):988-95.
[Genome Res. 2004]Comput Appl Biosci. 1997 Oct; 13(5):555-6.
[Comput Appl Biosci. 1997]Trends Genet. 2004 Dec; 20(12):598-603.
[Trends Genet. 2004]