![]() | ![]() |
Formats:
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||
Copyright : © 2006 Goodstadt and Ponting. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Phylogenetic Reconstruction of Orthology, Paralogy, and Conserved Synteny for Dog and Human Medical Research Council Functional Genetics Unit, University of Oxford, Department of Physiology, Anatomy, and Genetics, Oxford, United Kingdom George Weinstock, Editor Baylor College of Medicine, United States of America * To whom correspondence should be addressed. E-mail: leo.goodstadt/at/anat.ox.ac.uk Received March 9, 2006; Accepted August 21, 2006. This article has been cited by other articles in PMC.Abstract Accurate predictions of orthology and paralogy relationships are necessary to infer human molecular function from experiments in model organisms. Previous genome-scale approaches to predicting these relationships have been limited by their use of protein similarity and their failure to take into account multiple splicing events and gene prediction errors. We have developed PhyOP, a new phylogenetic orthology prediction pipeline based on synonymous rate estimates, which accurately predicts orthology and paralogy relationships for transcripts, genes, exons, or genomic segments between closely related genomes. We were able to identify orthologue relationships to human genes for 93% of all dog genes from Ensembl. Among 1:1 orthologues, the alignments covered a median of 97.4% of protein sequences, and 92% of orthologues shared essentially identical gene structures. PhyOP accurately recapitulated genomic maps of conserved synteny. Benchmarking against predictions from Ensembl and Inparanoid showed that PhyOP is more accurate, especially in its predictions of paralogy. Nearly half (46%) of PhyOP paralogy predictions are unique. Using PhyOP to investigate orthologues and paralogues in the human and dog genomes, we found that the human assembly contains 3-fold more gene duplications than the dog. Species-specific duplicate genes, or “in-paralogues,” are generally shorter and have fewer exons than 1:1 orthologues, which is consistent with selective constraints and mutation biases based on the sizes of duplicated genes. In-paralogues have experienced elevated amino acid and synonymous nucleotide substitution rates. Duplicates possess similar biological functions for either the dog or human lineages. Having accounted for 2,954 likely pseudogenes and gene fragments, and after separating 346 erroneously merged genes, we estimated that the human genome encodes a minimum of 19,700 protein-coding genes, similar to the gene count of nematode worms. PhyOP is a fast and robust approach to orthology prediction that will be applicable to whole genomes from multiple closely related species. PhyOP will be particularly useful in predicting orthology for mammalian genomes that have been incompletely sequenced, and for large families of rapidly duplicating genes. Synopsis Biologists often exploit the evolutionary relationships between proteins in order to explain how their findings are relevant to the biology of other species, including Homo sapiens. The most natural way to define these relationships is to draw family trees showing, for example, which human protein is the counterpart (“orthologue”) of a protein in dog, and which human proteins have arisen by recent duplication of existing genes (“paralogues”). On a small-scale this is relatively straightforward, but it is difficult to do this automatically on a genome-wide scale. In this paper the authors describe a new approach to drawing a giant family tree of all proteins from humans and dogs. They show how this tree allows them to refine some protein predictions and discard others that are likely to be nonfunctional dead sequences. Family relationships can show how the dog and human genomes have been rearranged since their last common ancestor. In addition, they help to identify the proteins that are specific to either dog or human, and which contribute to these species' biological differences. Giant trees, drawn from this method, will help to associate the differences, duplications, and evolution of proteins in different mammals with their distinctive physiologies and behaviours. Introduction Distinguishing orthologues (genes that arose via a speciation event) from paralogues (genes that arose via duplication within a genome) is critical to comparative biology. This is because orthology is the basis by which molecular function in humans can best be inferred from experimental results in model organisms. Orthologous genes are descended from a single gene in the last common ancestor of their two species [1]. They are hence most likely to share a conserved ancestral gene function. Genes may be duplicated to give rise to multiple additional copies, often lying in tandem. Lineage-specific duplicates (“in-paralogues” [2]) from two species together form an “orthologous clade” whose members are all descended from a single ancestral gene in the last common ancestral species (Figure 1
Traditionally, orthology relationships for individual gene families have been predicted by carefully constructed multiple alignments and by reconstructing phylogeny via the use of either maximum likelihood [8] or parsimony [9] methods. However, for genome-scale investigations, current methods do not yet automatically generate multiple alignments of unfailing quality, especially in the face of variable genomic data and gene prediction quality, rendering subsequent phylogenetic steps unreliable. Instead, orthology across whole genomes has been determined automatically using reciprocal best hits in all-against-all comparisons of amino acid sequences [2,6,7,10]. Two sequences are identified as orthologous if they find each other as the highest scoring alignments among all sequences from the other species. This procedure is most reliable for relatively flawless datasets, such as those from prokaryotes, but works less well where gene sets are incomplete and the predictions imperfect. In vertebrate eukaryotic species, the gene sets used for orthology prediction are much more likely to contain errors simply because the challenges for gene prediction are so much greater. Vertebrate genomes employ long introns, alternative splicing, and cryptic splice sites, and are more likely to contain sequencing errors, such as base changes and insertion–deletions, or assembly errors causing inversions and translocations. As a result, exons and transcripts may be absent from predicted genes, and more rarely, pseudogenes may be predicted wrongly as genes. When duplications have occurred in one or both lineages, the resulting orthologous genes are in one-to-many or many-to-many relationships (Figure 1 Ensembl [11] and Inparanoid [2] are widely used methods for predicting orthology and paralogy. Both approaches start with reciprocal best-hitting protein sequence pairs. Each method assumes that genes are best represented by the longest transcript, and no other splice variants are considered. For wholly sequenced, closely related genomes, Ensembl takes advantage of the observation that most in-paralogues are generated by tandem duplications. This tends to preserve gene order. Accordingly, if there is a series of orthologues defined by BLASTP [12] reciprocal best-hitting pairs that have the same relative gene order in both species and which fall within a tuneable genomic window size (e.g., 1.5 Mbp [13]), and if protein sequences from the intervening genes have high scoring BLAST hits to the initial orthologues, then these too will be gathered into an orthologue set [14]. Because this is not a reciprocal operation, some of the resulting orthologues between two species inevitably exhibit contradictory, nontransitive relationships (Figure 1 This Ensembl process does not correctly distinguish, in many cases, between “in-paralogues” (lineage-specific duplicates) and “out-paralogues” (duplicated genes present in the common ancestor of the two species). Where there have been lineage-specific gene losses or failures in gene prediction, then the corresponding gene in the other species should be identified as an “orphan” (an unpaired gene). Instead, Ensembl may assign such orphaned genes wrongly as members of a neighbouring orthologue family, even if they are distant homologues (Figure 1 Inparanoid employs stricter criteria for assigning paralogues than Ensembl [2]. Again, the main orthologue pair of each set of orthologues is first identified as the reciprocal best pairwise match. Unlike Ensembl, additional orthologues are then added to this set only if their proteins are more sequence-similar to the initial orthologue from the same species. By design, and partly for reasons of computational cost, Inparanoid examines only pairwise relationships and thus does not construct phylogenies. Instead, the method has a careful set of heuristics to merge, delete, or separate predicted orthologue sets with overlaps. These heuristics can only be readily understood given an implicit underlying phylogenetic model. Both Inparanoid and Ensembl labour under three limitations. First, orthology and paralogy relationships are properties of the evolutionary history of a gene family [1], barring partial gene conversions. Phylogenetic trees are thus the most natural way to represent the familial relationships among homologues. Relying solely on pairwise relationships fails to make optimal use of all available information. Second, underlying assumptions for both methods are that protein similarity accurately reflects evolutionary distance and that paralogues evolve at equal rates [2]. For recently duplicated genes, many of which are or have been evolving adaptively, the reliance on uniform selection upon all members of a gene family is particularly treacherous. Even in the general case, the rate of amino acid substitution varies by up to 300-fold [16]. As a result, orthologue sets may contain a disproportionate number of large families that have ancient divergences but are highly conserved. Rapidly evolving genes with recent provenances will be underrepresented. Third, Ensembl and Inparanoid make no explicit provision for handling genes with multiple transcripts. Both describe orthology and paralogy in terms of genes, and yet assign orthology not directly from gene comparisons but indirectly via protein sequence comparisons. However, where there are alternatively spliced variants, there is no obvious way to chose between all the possible sequence comparisons involving different variants. Ensembl, and analyses using Inparanoid, skirt around this problem by discarding all but the longest transcripts. However, there is no guarantee that the longest transcripts from orthologous genes are themselves orthologous throughout because they may each employ different combinations of exons. We consequently sought a new approach to predicting orthology and paralogy relationships which: (1) would be applicable to large-scale analyses of multiple entire genomes; (2) directly produces phylogeny; (3) would be less susceptible to variations in evolutionary rates; (4) handles multiple transcripts explicitly; and (5) would not rely on synteny information. PhyOP (phylogenetic orthology and paralogy) has been designed to meet all of these requirements. Unlike Inparanoid and Ensembl, PhyOP explicitly reconstructs phylogenies of transcripts to take advantage of all available sequence data. Gene orthology predictions are made by comparing the transcript phylogeny with the known species tree. PhyOP predicts orthology using a distance metric based not on amino acid substitutions, as in all other approaches to large-scale orthology prediction, but rather on dS, the number of synonymous nucleotide substitutions per synonymous site. Because silent mutations in coding DNA sequences do not lead to changes in the protein products, synonymous sites are under fewer evolutionary constraints than other coding sites [17], and hence more accurately reflect underlying neutral rates and the true evolutionary distance between genes [18]. dS values vary only by approximately 2-fold or 3-fold within mammalian genomes [6,19,20], which is two orders of magnitude lower than variations in the amino acid substitution rate [16]. Over long evolutionary distances, however, the method becomes increasingly less appropriate because of saturation at synonymous sites. Nevertheless, aggregate dS estimations have been employed even for relatively divergent species pairs, such as human and chicken, which are separated by more than 300 million years [4]. Our approach also differs from methods that rely on conserved gene order in inferring orthology. The use of PhyOP is thus appropriate both for relatively complete and for incomplete genomes such as those assembled only into short contigs. Having achieved this aim, we realised that we could exploit conserved gene order information so as to determine the method's efficacy. We use, as our basis dataset, Ensembl [21] genes for the newly sequenced dog (Canis familiaris) genome sequence together with the corresponding set for human (Homo sapiens). This provided an opportunity to compare the degrees of lineage-specific gene duplication in dog and human, and to consider the fraction of single orthologues that have persisted in both lineages, without apparent loss or duplication, since their common ancestor. Dog and human orthologues predicted by PhyOP can be downloaded from http://wwwfgu.anat.ox.ac.uk:8080/phyop_orthologs, and the software implementation is freely available from the authors. Results PhyOP is a phylogenetic method that uses the synonymous substitution rate dS as a proxy for the evolutionary distance. Currently, it is only suitable for the accurate prediction of orthology among species of recent divergence, such as the mammals. To recover gene phylogeny and to predict orthology, the design of PhyOP had to overcome five challenges. 1) Like other metrics relying on nucleotide sequence, dS saturates over large evolutionary distances and cannot be used to distinguish gene duplications that are more ancient than the divergence of the first mammals. Phylogenetic algorithms were required to be modified in order to disregard saturated values at large evolutionary distances (see Materials and Methods). 2) Inconsistent gene phylogenies due to missing or erroneous transcript predictions had to be resolved. 3) We also had to resolve apparent cases of merged genes in the Ensembl gene set, where adjacent and separate paralogues have apparently been amalgamated erroneously into a single prediction. 4) Pseudogenes present in the Ensembl gene set had to be distinguished from functional genes. Missing or inserted bases in high-coverage genome sequence are rare (estimated to be fewer than one in 104: see, for example, [22–24]). Yet even these would interrupt the reading frames of approximately 4% of protein coding genes (assuming 350 codons per gene). Apparent disruptions in otherwise functional genes can also result from missing exons and other problems in predicting exon boundaries, as well as from insertion–deletion single nucleotide polymorphisms. Ensembl attempts to rescue these functional genes by introducing short (≤ 10 bp) artificial introns around disruptions to the reading frame. However, as it is impossible to know a priori which disruptions are artefactual, the Ensembl procedure carries the necessary cost of mispredicting some nonfunctional pseudogenes with real in-frame disruptions as functional genes. 5) Optimal reconstruction of gene orthology and paralogy required phylogenies derived from all, rather than a subset of, transcripts. Transcript Phylogeny Other orthology prediction methods assume that the longest transcripts of orthologous genes would themselves be orthologous. Instead of making such an a priori choice of transcripts, we realised that a phylogeny of all available transcripts from both species would necessarily disambiguate transcripts that are orthologous from those that are paralogous. This could then be used as a secure basis from which to infer phylogeny on the gene level. Details of the derivation of transcript phylogeny by PhyOP are described in Materials and Methods and are illustrated as flow diagrams in Figure 2
Orthologue Predictions Using the phylogeny of all transcripts, PhyOP was able to predict 14,807 dog genes in 1:1 orthologue relationships with human genes (Table 1). This involved 87% of all predicted dog genes. Together with dog orthologues in “many” relationships, this method predicts orthology for 93% of genes in the dog genome assembly. This is approximately twice the proportion previously predicted for rat and human genes [7]. These numbers exclude Ensembl gene predictions that are likely to be nonfunctional pseudogenes (see below).
Orthologues from Consistent Phylogenies In the vast majority of cases, even though orthologous genes may have multiple splice variants, only one transcript for each gene was found to be in an orthologous relationship. This allowed the orthologous relationships between genes to be inferred straightforwardly and with confidence. This was the case for 14,896 dog genes and 15,417 human genes. These include 465 dog and 1,286 human genes that were involved in lineage-specific duplications. In only a surprisingly small number of cases (198 dog and 198 human genes) did genes possess multiple transcripts that were all in consistent orthologous relationships. These orthologues each have an identical number of transcripts, all in orthologous relationships (Figure 3
Orthologues from Inconsistent Phylogenies 720 dog and 859 human genes were predicted in orthology relationships following the resolution of inconsistent transcript phylogenies. These are cases where different combinations of orthology relationships between genes are suggested by different splice variants. In some instances, inconsistencies were due to missing transcripts (Figure 3 Separating Merged Genes We found 388 dog and 322 human gene predictions that appear to have been erroneously merged with neighbouring paralogous genes, although a small minority of these might represent chimeric gene fusions [27–29]. Such instances were evident among genes whose transcripts were placed in inconsistent phylogenies (Figure 4
We disentangled merged genes systematically as part of the orthology prediction method. Proper resolution of the transcript phylogeny exploited the observation that transcripts derived from merged genes are chimeric: they possess both orthologous and paralogous regions with respect to transcripts from the other species, and thus tend to exhibit elevated dS values and hence long branches. Most such problematic transcripts are, in fact, automatically rejected as “orphans” (i.e., those not in orthologous relationships with any other transcript) by our procedure. Using the previously described criteria, we selected a representative transcript while simultaneously discarding all other transcripts from the same gene with which it overlaps on the genome (see Figure 4 Pseudogenes An initial survey of predicted in-paralogues indicated significant contamination with processed pseudogenes. These are widely dispersed, intron-less, or disrupted copies of known multi-exonic genes, and are due to the retrotransposition of mature RNAs. Since there are an estimated 19,000 pseudogenes in the human genome [30], it is unsurprising that some of these should appear among the predicted gene set. Homologues of highly expressed ribosomal and RNA- and DNA-binding proteins are especially numerous among paralogous retrogenes. These have previously been shown to be overrepresented among pseudogenes [30–32], presumably because of the high expression of such genes in germline cells. As a result, predicted in-paralogues are more likely to have reading frame disruptions and single exons, and be located far from conserved syntenic regions (Table 2).
We used the following heuristic to filter out these retrogenes. All single-exonic or disrupted genes found outside syntenic blocks were discarded. Genes with multiple disruptions were also discarded as nonfunctional. In addition, for large orthologue sets with widely scattered members (on more than four chromosomes), we identified the orthologues (at least one from each species) most likely to represent true genes (using the criterion of three or more exons with matching exon boundaries in both species) and excluded all other orthologues with few (less than three) and nonmatching exons. Altogether, we used these criteria to identify 1,108 dog and 1,107 human candidate pseudogenes (Table 1) that, as expected, show considerable relaxation of selective constraint (Figure 5
Removal of these putative pseudogenes also left orphaned 88 dog and 349 human genes. Our lists of pseudogenes necessarily include functional genes that have multiple apparent disruptions due to sequencing or gene prediction errors. Since these are far more common in the incomplete dog genome assembly, there should be more true dog genes erroneously predicted as pseudogenes, and also more human than dog orthologues orphaned by pseudogenes. Quality of Orthologues Several independent measures show that orthology predictions produced by PhyOP are of high quality. Protein sequences corresponding to the representative transcripts of these 1:1 orthologues are aligned essentially throughout their entire lengths, and 92% have well-conserved exon boundaries (see Materials and Methods). Careful manual examination of selected genes suggests that most of the remaining discrepancies either derive from our conservative approach in comparing exon structure (some real changes in exon lengths have occurred since the human–dog divergence) or are due to errors in the prediction of gene structure, such as missing exons or extra introns. It should be noted that gene predictions for the dog genome are more challenging given both the paucity of dog mRNA data and the draft quality of the canine genome assembly. In-Paralogues The PhyOP pipeline predicted 540 dog and 1,548 human in-paralogues, representing 329 dog and 988 human duplication events (Table 1). Human gene duplications appear to have been fixed at a rate three times higher than in the dog lineage (see Discussion). In-paralogues have significantly lower percentages of identity (median of 78.1% versus 91.8% in 1:1 orthologues) and higher dN/dS ratios (median of 0.23; Table 4 and Figure 5
In-Paralogues Tend To Be Shorter than 1:1 Orthologues Most duplicated genes exhibit full-length alignments with their orthologues (median, 91%). Transcripts of in-paralogues, however, tended to be considerably shorter than those in 1:1 relationships, to encode shorter peptides, and to comprise fewer exons (medians of two and four for dog and human in-paralogues, compared with medians of seven and eight for dog and human 1:1 orthologues, respectively; Table 6). In-paralogues were also more likely to be single-exonic, which appears from close inspection of individual cases (including, for example, olfactory receptor and α-interferon genes) not to be due to contamination with large numbers of likely pseudogenes. Compared with 1:1 orthologues, in-paralogues are considerably less likely to possess conserved exon boundaries (67% versus 92% in 1:1 orthologues), perhaps reflecting the greater challenge in predicting adjacent sequence-similar paralogues.
The evidence thus suggests that there is a mutational bias towards shorter in-paralogues. Duplicated genomic segments in many species tend to have an L-shaped distribution curve biased towards short regions [39,40], and a majority of segmental duplications in the human-lineage are smaller than the median gene size ([40] and Table 6). Longer genes may be less likely to be duplicated in their entirety with promoter and multiple exons intact and may be more likely, instead, to give rise to nonfunctional gene fragments. Orphaned Genes without Predicted Orthologues A minority of genes (1,189 dog and 4,273 human genes) did not possess any transcripts in orthologous relationships and were classified by PhyOP as “orphans.” These are genes in which the corresponding copy has either been lost or has failed to be predicted in the other lineage. Many of these orphans may not represent functional protein-coding genes. Instead, they include chimeric transcripts or even non–protein-coding sequences as a result of assembly or gene prediction errors. cDNAs generated by high-throughput projects are occasionally incomplete, consisting only of the untranslated regions. This leads to spurious open reading frames being called within the untranslated region and submitted to protein databases as genuine coding transcripts (Ewan Birney, personal communication). These various types of defective genes would all tend to have increased dS, suggesting an ancient divergence from any other partially homologous sequence. This would be consistent with the large proportion of orphan genes that are single-exonic (24% and 43% out of dog and human orphans, respectively, versus 6% of 1:1 orthologues) and the overrepresentation of genes with multiple apparent frame disruptions in the dog genome (6% of orphans versus 3% of 1:1 orthologues). Other orphans, especially in the dog genome, appear to represent genes that have been predicted only as multiple fragments (e.g., the dog gene fragments corresponding to the human titin [TTN] gene). Few large genomic regions in either species were without orthologues, confirming the general high quality both of the dog assembly (CanFam 1) used for the gene build, and of Ensembl's predicted gene set. There were only two regions in the human genome that contained 20 or more orphaned genes in the dog (containing 22 and 24 genes, respectively). The largest number of consecutive dog genes without a predicted human orthologue was only 11. (This is despite the many human genes [1,766 ] without close dog homologues, using a BLAST upper threshold of <10−5.) Estimating the Human Gene Count Our procedure for distinguishing pseudogenes and our discovery of apparently chimeric merged gene predictions necessitates a readjustment of the number of functional protein coding genes as identified by the Ensembl gene prediction pipeline. With a starting human gene set of 22,212, adding 164 previously merged genes and removing 1,107 pseudogenes gives a revised gene count of 21,269. However, many of the orphans are likely also to be nonfunctional, as discussed above. The number of fragmentary and nonfunctional genes among orphans can be estimated simply from the excess of single-exon orphaned gene predictions compared with 1:1 orthologues (Table 2): (42.6% − 6.4% = 36.2%) of 4,273 orphans ≈ 1,550 pseudogenes. This provides an estimate of 19,700 functional human protein coding genes predicted by Ensembl. This rough estimate assumes that all the putative retrotransposed genes we have identified are nonfunctional and that, conversely, most of the orthologues we predict represent real genes and not pseudogenes. Nevertheless, this number is similar to the predicted protein-coding gene count of nematode worms [14] and a protein-coding human gene count estimated using an independent comparative approach (19,400: Michele Clamp, personal communication). Orthologous Chromosomal Segments The high coverage and accuracy of PhyOP allowed us to create a gene-based map of conserved synteny between dog and human genome assemblies (Figure 7
By analogy with whole-genome alignment methods [6], we defined a micro-syntenic segment to be a chromosomal region from one species that contains genes whose orthologues occur in the same order and transcriptional orientation on a single chromosome of the other species. A macro-synteny block comprises one or more micro-syntenic segments that are contiguous in both species but which might be rearranged in order or in orientation. Mapping the dog and human PhyOP orthologues to their genome assemblies revealed 178 dog and 192 human macro-synteny blocks. Half of all orthologues reside in macro-syntenic blocks of 145 and 167 genes or larger in the dog and human genomes, respectively. Gene order is, in the main, highly conserved across the dog and human genomes since few genes reside in small blocks: <1.2% can be found in small blocks containing fewer than ten orthologues. In particular, dog chromosomes CFA29, CFA30, CFA32, CFA36, and CFAX are all completely syntenic to regions of HSA8, HSA4, HSA6, HSA2, and HSAX, respectively; and CFA12, CFA21, CFA24, CFA28, CFA33, CFA35, and CFA38 are orthologous to regions of HSA6, HSA11, HSA20, HSA10, HSA3, HSA6, and HSA1, respectively (Figure 7 Collinear gene order is conserved at larger distances, including over entire lengths of the X chromosomes (Figure 8
In-paralogues are much more likely to be found in smaller micro-syntenic blocks, probably because both gene duplications and chromosomal rearrangements are correlated with the rate of chromosome breakage [46]. The corresponding NOrth,50 values for such dog and human genes are only 25 and 15 (Table 7).
Benchmarking PhyOP with Ensembl and Inparanoid Methods We compared the dS-based orthologue predictions by PhyOP to two other sets predicted on the basis of protein similarity: the first set from Ensembl's Compara database [11], and the second predicted using Inparanoid [2]. Initial orthologue sets for both Ensembl and Inparanoid are founded on protein sequences which are the reciprocal BLASTP [12] best matches of each other. These are described by Ensembl as UBRH or MBRH, for unique or multiple best reciprocal hits [13]. Ensembl also incorporates additional nonreciprocal best matches (RHS, or reciprocal hit based on synteny information, in Ensembl nomenclature) if both genes are less than 1.5 Mb away from a pair of BLAST reciprocal best matches. As described in the Introduction, RHS genes are not derived from a reciprocal procedure and many exhibit nontransitive and conflicting relationships (Figure 1 The three methods, PhyOP, Ensembl, and Inparanoid, predicted similar numbers of 1:1 orthologues (14,807, 13,917, and 14,047). The three sets of predictions largely overlapped, with 12,778 common to all three methods (Figure 9
PhyOP In-Paralogues Are Very Different from Ensembl and Inparanoid Predictions However, orthology predictions where duplications have occurred in the dog or human lineages (i.e., those in one-to-many, many-to-one, or many-to-many relationships) differ significantly among the three methods. PhyOP predicts 2,469 such relationships, compared with 3,247 for Ensembl and only 832 for Inparanoid. The majority (88%) of Inparanoid orthologues in “many” relationships are a subset of those from Ensembl, but PhyOP predictions are strikingly different from either (Figure 9 The orthologues in “many” relationships predicted by the three methods had similar values for protein sequence coverage and percentage identity (Table 4), but Inparanoid alignments were noticeably shorter (median lengths of 207 residues versus medians of 291 and 275 for PhyOP and Ensembl). The dS distribution curves for Ensembl and Inparanoid “many” orthologues were greatly shifted to higher values (dashed lines in Figure 6 Conservation of Gene Order among PhyOP, Ensembl, and Inparanoid Predictions If most genes are duplicated in local tandem copies, and if the rate of genomic rearrangement is low relative to that for gene duplication, then most orthologues would tend to be conserved in gene order. Consequently, we sought to use conserved synteny as a useful benchmark for determining the reliability of each method. We used the size of the micro-synteny segments (those with conserved gene order and transcriptional orientation) as a measure of conservation of ancestral gene order. We found that PhyOP orthologues are more likely to have conserved gene order between dog and human (NOrth,50 of 48), with Inparanoid (NOrth,50 of 46) and Ensembl (NOrth,50 of 43) orthologues more likely to be found in smaller segments (Table 7). We also wondered whether the “many” orthologues common to Ensembl, Inparanoid, and PhyOP would be more reliable than those of any one alone. We found, however, that PhyOP orthologue predictions that were also shared by the other two approaches had exactly the same NOrth,50 of 48. Lineage-Specific Biology Though duplications appear to have been fixed at a higher rate in the human lineage than in the dog, the functional classes of genes involved are very similar. In both species, genes involved in immunity, chemosensation, and reproduction are prominent (Tables 8 and 9), much as has previously been observed for other mammalian species [3,6,7]. A significant number of dog and human in-paralogues appear to involve independent duplications in both the genomes (46% and 20% of dog and human in-paralogues). It is likely that some of these cases represent gene conversions of more anciently diverged out-paralogues in both species, while others represent true independent duplications arising from common selective pressures on both lineages.
In the main, gene duplications have generated in-paralogues that lie in tandem in the extant genomes. The striking exception to the close physical proximity of in-paralogues is the human-lineage-specific duplication of KRAB-zinc finger (KRAB-ZnF) genes [47]. The ancestral genes, which have been inherited without dispersal in the dog lineage on CFA1, have been duplicated onto twelve human chromosomes (Table 9). Dispersal in the human lineage has not involved retrotransposition as KRAB-ZnF gene structures have been preserved. What then might have caused the unusual dispersal of these genes? One possibility is that these genes lie in sequence that has been especially susceptible to duplication. However, it is also possible that the disruption of physical linkage between in-paralogues might have proved advantageous. This might be because selection on closely linked genes is often less efficient (the Hill-Robertson effect [48]): KRAB-ZnF genes often appear to be under positive selection [6,47]. However, the functions of primate KRAB-ZnF genes remain obscure and the molecular and cellular basis for their proposed adaptive events remain to be determined. Discussion We have presented a new phylogenetic method, PhyOP, which has succeeded in predicting human orthologues for 93% of dog genes. The 1:1 orthologues predicted by PhyOP appear to be more comprehensive and more accurate than those of Inparanoid and Ensembl. The method's major advances, however, are in the predictions of in-paralogues and transcript phylogenies. In-paralogues predicted by PhyOP are more numerous, are less divergent at synonymous sites, and better recapitulate conserved synteny than either Ensembl or Inparanoid. Consistent orthology, including the conservation of intron–exon boundaries, may be useful in detecting mispredicted and nonfunctional genes, and we have identified numerous chimeras and candidates for pseudogenes in the dog and human genomes. Human Gene Duplications Are More Numerous There is a considerable disparity between the numbers of dog and human in-paralogues. It appears that the human lineage has accumulated 3-fold more gene duplicates than has the dog lineage. This may be a result of the lower rates of repeat-mediated segmental duplication in the dog lineage [49] associated with the almost 10-fold lower activity of endogenous retroviral and DNA transposons compared with that in the human [24]. It is also likely that some duplicated genomic regions have been collapsed in the draft assembly of the dog genome. There is, however, an alternative explanation for the larger number of gene duplicates in the human genome assembly: many gene duplicates represented in the human genome assembly may not have been fixed in the population. Rather, they are copy number variants whose appearance in the human genome assembly reflects the mosaicism of the human reference sequence, with contributions from the diverse haplotypes of each of the multiple sequenced individuals. Certainly the majority of duplicates are of recent provenance because their divergences are extremely low [23]. By contrast, the dog genome has been assembled from only a single inbred boxer dog without the incorporation of copy number variants from other dogs. Characteristics of In-Paralogues Most in-paralogues in the dog and human lineages are found in tandem arrays, though human in-paralogues are more likely to have dispersed either to beyond 20 genes from their conserved syntenic gene position on the same chromosome (5%) or else to another chromosome (9%). Again, this may be related to higher rates of repeat-mediated human segmental duplication. In both species, in-paralogues appear to be enriched in genes with few exons. There are overrepresentations of short genes, including single-exonic genes, and those with two or three exons, which are positioned in conserved synteny and thus are unlikely to be nonfunctional retrogenes. The preponderance of short genes would be explained by the relative infrequency of segmental duplications that are sufficiently large to completely encompass sprawling multi-exonic genes, including their 5′ and 3′ regulatory regions [40]. In-paralogues also appear to exhibit higher apparent mutation rates. Estimated dS values between orthologous genes are higher if these have contributed to lineage-specific duplications (Figure 6 The genes that have duplicated in the dog lineage often possess functions in immunity (e.g., α-interferons), chemosensation (e.g., olfactory receptors), and toxin degradation (e.g., cytochrome P450s), categories which are enriched among gene duplications in other mammalian lineages. Nevertheless, the infrequency of dog gene duplications in an evolutionary lineage that has experienced great variation in anatomical morphology indicates that much developmental change may arise not by gene duplication but within the non–protein-coding regulatory segments of the mammalian genome. Alternative Distance Metrics Although, we have used dS values as a proxy for neutral rates in the analysis of the dog and human genomes, PhyOP can also make use of other similar measures. These include divergence of ancestral repeats or of the interiors of introns, which are relatively free of functional constraints [54,55]. We have also shown separately that PhyOP can accurately infer relationships between more divergent genes and species using amino acid–based distances (unpublished data). Conclusion The PhyOP pipeline has provided robust and high-quality orthology and paralogy predictions for the dog and human genomes. However, this approach is also eminently suitable for unravelling the relationships between genes from multiple species simultaneously. Pairwise orthology prediction inaccuracies are additive, and the performance deteriorates with each additional species. Phylogenetic predictions, in contrast, grow more reliable as additional data from each genome allow previous gaps, due, for example, to gene deletions, to be illuminated. Because PhyOP does not use synteny information to predict orthology, it would also be applicable to partially assembled, incompletely sequenced genomes. In the case of the mammalian genomes currently being sequenced at low (~2-fold) statistical coverage [15], only ~86% of the bases in each genome will be covered, leading to many missed genes from each of the sequenced species. This will greatly complicate pairwise assignment of orthology by Ensembl or Inparanoid. PhyOP, using a fully phylogenetic approach to analyse the cohort of genomes simultaneously, should be highly reliable even in the face of missing genes. Materials and Methods Conventions In this article, we indicate the class of orthology relationship by counts of dog orthologues followed by human unless otherwise specified. Thus, a one-to-many relationship refers to a single dog gene that is orthologous to several human-specific duplications. “Genes” in this article always refers to protein-coding genes. Identification of homologues We collated all human and dog peptide sequence predictions from Ensembl (EnsMart version 27.1). Homologues were identified and aligned using BLASTP [12] using an E value upper threshold of 1 × 10−5. Alignments with fewer than 50 aligned residues were discarded. BLAST results are occasionally asymmetric due to heuristic failure. In such cases, we therefore always used the alignment with the higher bit score. Deriving gene phylogenies via transcript phylogenies Our method assigns phylogenetic relationships among all transcripts for two species. From this transcript phylogeny, we reconstructed a second phylogeny, that for genes, to predict their orthology. Partial alignments (such as those between shared domains) can result in sprawling, transitively linked clusters of up to 10,000 transcript sequences. To overcome this, we seeded transcript clusters by single linkage, joining pairwise relationships where the alignments cover at least 60% of the residues in both sequences (Figure 2 dS-Based phylogenies Our simulations using the Evolver programme from the PAML package [56], and codon frequencies derived from the dog and human genomes, show that codeml is able to reliably estimate dS values up to 2.5 (unpublished data). For dS values between 2.5 and roughly 5.0–6.5, codeml is still able to give informative estimates (mean and median values are within 5%), but these are prone to increasingly larger errors due to saturation at synonymous sites. Accordingly, we disregarded all sequence comparisons which resulted in dS > 5.0. In addition, we biased our calculations so that larger dS values more prone to errors were down-weighted. Phylogenies were built from sets of sequences related by reliable dS. These were obtained by clustering sequence pairs first by single linkage and then using a modified version of the UPGMA algorithm (Figure 2 dS-based distance matrices for these sequence sets frequently contain missing values. These can occur when sequences represent gene fragments or alternative transcripts so that some sequence pairs either do not overlap or result in alignments that are too short. However, the majority of missing values in the distance matrix are due to the discarded large dS values (>5.0) representing long branches. Most popular distance-based methods, for example, neighbour-joining and BioNJ, require complete matrices without any missing values. The simulations of Makarenkov and Lapointe [58] show that weighted least-squares algorithms such as the Kitsch or Fitch programmes from the PHYLIP suite of programs [8] are clearly the most effective way to recover underlying phylogenies for incomplete matrices with missing values. Such approaches make it easy to accommodate the rapidly increasing errors as dS approaches saturation at large values. Modifying the Fitch-Margoliash criterion for incomplete distance matrices (with saturated dS values) The least-squares algorithm as implemented in the Kitsch programme in PHYLIP tries to derive rooted phylogenies whose branch lengths are least different from the supplied distance matrix. In other words, the following expression is minimized:
We further modified the algorithm to avoid pernicious “long-branch attraction” errors due specifically to missing values. Pathological tree topologies containing branch joins based only on missing values were down-weighted by assigning an additional large weighting factor (of 10,000) to each occurrence. Given that we selected our initial sets of sequences using hierarchical single-linkage clustering, there can never be subgroups without any dS relationships between them. Trees with pathological joins can never be the globally optimal solution. The additional weighting term allows optima to be found away from these gaps in the optimisation landscape. We used 50 random tree topologies as well as the hierarchical single-linkage cluster as initial starting points for branch and bound search. To avoid redundant searches, a lookup table was used to associate calculated tree scores with the MD5 128-bit hashes of each normalised topology. This greatly speeded up the algorithm and allowed wider ranging analyses of larger tree branches using more modest computing resources. Inferring orthology by congruence with the species tree Orthology and paralogy relationships among the transcripts were inferred automatically by minimising the number of duplications that must be invoked to reconcile the transcript phylogeny with the species tree [60]. In this study, we were only applying the orthology prediction pipeline to sequences from two species, and because the Kitsch programme produces rooted bifurcating trees, the algorithm of Zmasek and Eddy [60] can be greatly simplified. Any node whose two branches each contain only dog sequences and human sequences, respectively, can be mapped to the last common ancestor of the dog and human on the species tree. All dog and human sequences in the clade defined by such a node represent descendants of a single original gene and are hence orthologues (e.g., clades 1 and 2 in Figure 3 Choosing representative transcripts For each gene, we chose a single representative transcript from the phylogenies by applying four heuristics: first, we ruled out all orphaned transcripts outside of orthologue clades (Figure 2 Genes whose representative transcripts are from the same clade were paralogous to each other if they were from the same species, and orthologous if they were from different species. Separating merged genes To recover physically adjacent genes that have been mispredicted as a single merged gene, we searched for genes with multiple nonoverlapping transcripts. As described above, we only considered transcripts in orthologue clades. Thus, considering the canine gene with ten transcripts in Figure 4 Conserved syntenic gene order If gene order was conserved, adjacent orthologues in one species should be neighbours in the other. In many cases, contiguity appeared to have been interrupted by gene insertions in one species (or corresponding losses in the other). We calculated the minimum syntenic distance for a gene as the smallest difference in gene order between neighbours of its orthologues in the other species. This process is illustrated by the example in Figure 10
Pseudogenes Likely pseudogenes were identified by the presence of short introns (less than 10 bp), indicating frameshift or in-frame stop codon disruptions, or by the lack of conserved syntenic gene order in dispersed genes (syntenic distance > 20 genes). We conservatively labelled as a pseudogene any 1) dispersed gene with one or more disruptions, 2) syntenic gene with multiple disruptions, and 3) dispersed single exonic gene. The latter represent mostly retrotransposed pseudogenes (see Results). Some orthologous clades had so many apparent pseudogenes (there were three with more than 70 genes each) spread across the genome that many had an orthologue with a syntenic distance of <20 genes simply by chance. To identify these pseudogenes, we defined widely scattered families as those with members on more than four chromosomes. True orthologues were determined using the criterion of three or more exons with matching exon boundaries in both species. All the members of the orthologue clade with two or fewer exons with nonmatching exon boundaries were labelled as pseudogenes. Orthologous chromosomal segments “Micro-syntenic” blocks of orthologous genes were constructed by grouping together successive genes with conserved gene order and orientation among predicted orthologues in the other species. “Macro-syntenic” blocks were created by concatenating contiguous micro-syntenic blocks that, after rearrangements and inversions, corresponded to a single block of orthologues with conserved gene order in the other species [6]. For orthologous genes in “many” (e.g., one-to-many) relationships, any of the alternative orthologues that conserve gene order and strand qualify the gene for inclusion in a micro-syntenic block. Both types of syntenic blocks are thus directional and species-specific. Ensembl and Inparanoid orthology prediction Ensembl orthologue predictions were obtained from the Compara database (version 27.1). Orthologue sets were created by joining together pairwise orthology predictions using single-linkage clustering. Ensembl orthologues were based on the longest transcripts of each gene [11,14], and where alignment and dS data are given, we have calculated these from the corresponding sequences. Inparanoid version 1.35 was used to predict orthology from BLASTALL alignments results (National Center for Biotechnology Information [NCBI] version 2.2.12) for the longest gene transcripts, as described previously [61]. We used the BLOSUM80 matrix and an X drop-off value of 150. Conservation of exon boundaries The exon loci for each gene were obtained from Ensembl and mapped onto peptide sequences. We identified conserved exon boundaries if they fell within three corresponding codon positions of each other. We ignored exons that fell either outside or at the two ends (distal three codons) of the aligned regions. We also overlooked cases where a single intron was missing in one sequence if the pair included three or more otherwise aligned exons. GO terms GO [62] assignments for all human genes (EnsMart version 27.1) were retrieved from Ensembl. GO terms for dog genes were assigned on the basis of their orthology relationships with human genes. To summarise the large number of overlapping GO terms in Table 3, we used only terms from the overarching GO-SLIM set [63]. This consists of 36 component, 41 function, and 52 biological process terms. The statistical significance of overrepresentations of each GO term among human- and dog-specific paralogues was evaluated using the cumulative hypergeometric distribution with reference to the representation of that term among all orthologues. Only GO terms that are significantly overrepresented (p < 10−4) among pseudogenes and in-paralogues are presented in Tables 3 and 8. Acknowledgments We are indebted to Ensembl, Kerstin Lindblad-Toh, and the Dog Genome Sequencing Consortium for providing the dog genome sequence and gene build, and members of the research group of CPP, particularly Andreas Heger, for advice and helpful discussions. Abbreviations
Footnotes Competing interests. The authors have declared that no competing interests exist. Author contributions. LG conceived of, designed, and implemented PhyOP. CPP provided ideas and advice throughout the project and contributed to the genome analysis. LG and CPP wrote the paper together. Funding. CPP and LG received funding from the UK Medical Research Council. References
|
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
||||||||||||||||||||||||||||||||||||||||||||||||||||
Syst Zool. 1970 Jun; 19(2):99-113.
[Syst Zool. 1970]J Mol Biol. 2001 Dec 14; 314(5):1041-52.
[J Mol Biol. 2001]Hum Mol Genet. 2003 Apr 1; 12(7):701-9.
[Hum Mol Genet. 2003]Nature. 2004 Dec 9; 432(7018):695-716.
[Nature. 2004]Nature. 2004 Apr 1; 428(6982):493-521.
[Nature. 2004]J Mol Evol. 1981; 17(6):368-76.
[J Mol Evol. 1981]J Mol Biol. 2001 Dec 14; 314(5):1041-52.
[J Mol Biol. 2001]Nature. 2002 Dec 5; 420(6915):520-62.
[Nature. 2002]Nature. 2004 Apr 1; 428(6982):493-521.
[Nature. 2004]Science. 1997 Oct 24; 278(5338):631-7.
[Science. 1997]Nucleic Acids Res. 2003 Jan 1; 31(1):38-42.
[Nucleic Acids Res. 2003]J Mol Biol. 2001 Dec 14; 314(5):1041-52.
[J Mol Biol. 2001]Nucleic Acids Res. 1997 Sep 1; 25(17):3389-402.
[Nucleic Acids Res. 1997]Nature. 2002 Dec 5; 420(6915):520-62.
[Nature. 2002]Proc Natl Acad Sci U S A. 2005 Mar 29; 102(13):4795-800.
[Proc Natl Acad Sci U S A. 2005]J Mol Biol. 2001 Dec 14; 314(5):1041-52.
[J Mol Biol. 2001]Syst Zool. 1970 Jun; 19(2):99-113.
[Syst Zool. 1970]J Mol Biol. 2001 Dec 14; 314(5):1041-52.
[J Mol Biol. 2001]J Theor Biol. 2006 Jun 21; 240(4):616-26.
[J Theor Biol. 2006]Science. 2000 Oct 13; 290(5490):331-3.
[Science. 2000]Nature. 2002 Dec 5; 420(6915):520-62.
[Nature. 2002]Genome Res. 2003 Jan; 13(1):13-26.
[Genome Res. 2003]Mol Biol Evol. 2004 Oct; 21(10):1820-30.
[Mol Biol Evol. 2004]Nature. 2004 Dec 9; 432(7018):695-716.
[Nature. 2004]Nucleic Acids Res. 2005 Jan 1; 33(Database issue):D447-53.
[Nucleic Acids Res. 2005]Nature. 2004 May 27; 429(6990):365-8.
[Nature. 2004]Nature. 2005 Dec 8; 438(7069):803-19.
[Nature. 2005]Nature. 2004 Apr 1; 428(6982):493-521.
[Nature. 2004]Trends Genet. 2005 Feb; 21(2):73-7.
[Trends Genet. 2005]Proc Natl Acad Sci U S A. 2005 Feb 22; 102(8):2850-5.
[Proc Natl Acad Sci U S A. 2005]Genome Res. 2006 Jan; 16(1):30-6.
[Genome Res. 2006]Genome Res. 2006 Jan; 16(1):37-44.
[Genome Res. 2006]Genome Res. 2003 Dec; 13(12):2559-67.
[Genome Res. 2003]Genome Res. 2000 May; 10(5):672-8.
[Genome Res. 2000]BMC Evol Biol. 2004 Jul 6; 4():22.
[BMC Evol Biol. 2004]Science. 2003 Nov 21; 302(5649):1401-4.
[Science. 2003]Mutat Res. 1992 May; 267(1):43-54.
[Mutat Res. 1992]Mol Biol Evol. 2000 Jan; 17(1):68-74.
[Mol Biol Evol. 2000]Genetics. 2003 Dec; 165(4):1793-803.
[Genetics. 2003]Mol Biol Evol. 2005 Jan; 22(1):135-41.
[Mol Biol Evol. 2005]Genome Res. 2005 Jan; 15(1):98-110.
[Genome Res. 2005]Nature. 2002 Dec 5; 420(6915):520-62.
[Nature. 2002]Genomics. 1999 Oct 15; 61(2):145-55.
[Genomics. 1999]Proc Natl Acad Sci U S A. 2003 Apr 29; 100(9):5296-301.
[Proc Natl Acad Sci U S A. 2003]Nature. 2005 Dec 8; 438(7069):803-19.
[Nature. 2005]Genome Res. 1999 Dec; 9(12):1223-30.
[Genome Res. 1999]Genomics. 2002 Mar; 79(3):451-7.
[Genomics. 2002]Genome Biol. 2004; 5(4):R23.
[Genome Biol. 2004]Nucleic Acids Res. 2003 Jan 1; 31(1):38-42.
[Nucleic Acids Res. 2003]J Mol Biol. 2001 Dec 14; 314(5):1041-52.
[J Mol Biol. 2001]Nucleic Acids Res. 1997 Sep 1; 25(17):3389-402.
[Nucleic Acids Res. 1997]Hum Mol Genet. 2003 Apr 1; 12(7):701-9.
[Hum Mol Genet. 2003]Nature. 2002 Dec 5; 420(6915):520-62.
[Nature. 2002]Nature. 2004 Apr 1; 428(6982):493-521.
[Nature. 2004]Genome Res. 2003 Jun; 13(6A):1097-110.
[Genome Res. 2003]Genet Res. 1966 Dec; 8(3):269-94.
[Genet Res. 1966]Nature. 2002 Dec 5; 420(6915):520-62.
[Nature. 2002]Am J Hum Genet. 2003 Oct; 73(4):823-34.
[Am J Hum Genet. 2003]Nature. 2005 Dec 8; 438(7069):803-19.
[Nature. 2005]Nature. 2004 Oct 21; 431(7011):931-45.
[Nature. 2004]Mol Biol Evol. 2005 Jan; 22(1):135-41.
[Mol Biol Evol. 2005]Genome Biol. 2002; 3(2):RESEARCH0008.
[Genome Biol. 2002]Mutat Res. 1992 May; 267(1):43-54.
[Mutat Res. 1992]Mol Biol Evol. 2000 Jan; 17(1):68-74.
[Mol Biol Evol. 2000]Trends Genet. 2003 Feb; 19(2):65-8.
[Trends Genet. 2003]Genetics. 1984 Mar; 106(3):529-48.
[Genetics. 1984]Mol Biol Evol. 2004 Jun; 21(6):1014-23.
[Mol Biol Evol. 2004]Proc Natl Acad Sci U S A. 2005 Mar 29; 102(13):4795-800.
[Proc Natl Acad Sci U S A. 2005]Nucleic Acids Res. 1997 Sep 1; 25(17):3389-402.
[Nucleic Acids Res. 1997]Comput Appl Biosci. 1997 Oct; 13(5):555-6.
[Comput Appl Biosci. 1997]Genetics. 2001 Jan; 157(1):295-305.
[Genetics. 2001]Comput Appl Biosci. 1997 Oct; 13(5):555-6.
[Comput Appl Biosci. 1997]Bioinformatics. 2004 Sep 1; 20(13):2113-21.
[Bioinformatics. 2004]J Mol Evol. 1981; 17(6):368-76.
[J Mol Evol. 1981]Science. 1967 Jan 20; 155(760):279-84.
[Science. 1967]Bioinformatics. 2001 Sep; 17(9):821-8.
[Bioinformatics. 2001]Nature. 2002 Dec 5; 420(6915):520-62.
[Nature. 2002]Nucleic Acids Res. 2003 Jan 1; 31(1):38-42.
[Nucleic Acids Res. 2003]Nucleic Acids Res. 2005 Jan 1; 33(Database issue):D476-80.
[Nucleic Acids Res. 2005]Nat Genet. 2000 May; 25(1):25-9.
[Nat Genet. 2000]Genome Res. 2003 Jan; 13(1):103-7.
[Genome Res. 2003]J Hum Genet. 2004; 49(10):548-57.
[J Hum Genet. 2004]Nature. 2002 Dec 5; 420(6915):520-62.
[Nature. 2002]