![]() | ![]() |
Formats:
|
||||||||||||
Copyright © 2004 Naumoff et al; licensee BioMed Central Ltd. Retrieving sequences of enzymes experimentally characterized but erroneously annotated : the case of the putrescine carbamoyltransferase 1Institut de Génétique et Microbiologie, CNRS UMR 8621, Université Paris Sud, Bâtiment 409, 91405 Orsay Cedex, France 2Microbiology, Free University of Brussels (VUB) and J.M. Wiame Research Institute 1, ave E. Gryzon, B-1070, Brussels, Belgium 3State Institute for Genetics and Selection of Industrial Microorganisms I-Dorozhny proezd, 1, Moscow 117545, Russia Corresponding author.Daniil G Naumoff: daniil_naumoff/at/yahoo.com; Ying Xu: xuyingbelgium/at/yahoo.com; Nicolas Glansdorff: nglansdo/at/vub.ac.be; Bernard Labedan: bernard.labedan/at/igmors.u-psud.fr Received April 8, 2004; Accepted August 2, 2004. This is an open-access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. This article has been cited by other articles in PMC.Abstract Background Annotating genomes remains an hazardous task. Mistakes or gaps in such a complex process may occur when relevant knowledge is ignored, whether lost, forgotten or overlooked. This paper exemplifies an approach which could help to ressucitate such meaningful data. Results We show that a set of closely related sequences which have been annotated as ornithine carbamoyltransferases are actually putrescine carbamoyltransferases. This demonstration is based on the following points : (i) use of enzymatic data which had been overlooked, (ii) rediscovery of a short NH2-terminal sequence allowing to reannotate a wrongly annotated ornithine carbamoyltransferase as a putrescine carbamoyltransferase, (iii) identification of conserved motifs allowing to distinguish unambiguously between the two kinds of carbamoyltransferases, and (iv) comparative study of the gene context of these different sequences. Conclusions We explain why this specific case of misannotation had not yet been described and draw attention to the fact that analogous instances must be rather frequent. We urge to be especially cautious when high sequence similarity is coupled with an apparent lack of biochemical information. Moreover, from the point of view of genome annotation, proteins which have been studied experimentally but are not correlated with sequence data in current databases qualify as "orphans", just as unassigned genomic open reading frames do. The strategy we used in this paper to bridge such gaps in knowledge could work whenever it is possible to collect a body of facts about experimental data, homology, unnoticed sequence data, and accurate informations about gene context. Background As a consequence of the deluge of completely sequenced genomes belonging to a large array of species, one can expect to identify many homologues of enzymes which have been previously well studied at the experimental level. This seems to be the general rule and the public sequence databanks (DDBJ/EMBL/GenBank) are now inundated by putative amino acid sequences which have been annotated uniquely by the widely used two-step process : (1) detection of a homologous relationship by a pairwise sequence similarity search at the level of primary structure and (2) inference of functional similarity from this detected homology. However, the opposite might be true. For various reasons one can either miss or misinterpret the actual function of a putative protein when annotating by homology, resulting in a wrong function transfer. Several studies have already emphasized this point (see, for example, [1-3]). On the other hand, beside these now well identified errors which are often due to automatic processes, more subtle mistakes may occur when some of the numerous effects of divergent evolution are overlooked. In particular, one of the insufficiently appreciated problems of functional assignment is that homologous proteins might catalyse different biochemical reactions. Here, we discuss an instance of erroneous annotation (misannotation) in genes of nitrogen metabolism which to our knowledge has not yet been brought up. We explain why this is so and draw attention to the fact that similar cases must actually be rather frequent. Results and Discussion Annotating distant carbamoyltransferases Our group ([4,5]) is presently involved in deciphering the evolutionary relationships between two ubiquitous and essential proteins, aspartate carbamoyltransferase (ATCase, EC 2.1.3.2) which catalyses the first committed step of de novo pyrimidine biosynthesis and ornithine carbamoyltransferase (OTCase, EC 2.1.3.3) which plays a crucial role in both anabolism and catabolism of arginine. In a recent study of the phylogeny of the 245 available OTCases (paper in preparation), we confirmed the existence of two families, OTC alpha and OTC beta, previously proposed on the basis of phylogenetic studies [4]. However, the advent of many new sequences further led to a more complex topology of the distance tree schematized on Fig. Fig.1.1
The YgeW protein encoded by Escherichia coli and its close homologue from Clostridium botulinum are both located on a long branch emerging at the basis of this OTCases tree. YgeW is annotated as belonging to the ATCase/OTCase family (see, for example, the SwissProt knowledgebase [6]). On the branch which is next to the root we find the sequence of a protein which has been reported to be essential for arginine biosynthesis in the anaerobic bacterium Bacteroides fragilis [7]. This protein has been crystallized and characterized as a carbamoyltransferase-like protein since it does not display OTCase activity in vitro [7]. Indeed, several of its residues have been substituted in sites which are viewed as crucial for OTCase activity. Moreover, Dashuang et al. [7] indicated that a similar protein has been found in Xylella fastidiosa. Our phylogenetic data are in agreement with this observation since the protein annotated as OTCases in two strains of X. fastidiosa and its close relative present in two species of the Xanthomonas genus are found to branch close to that of B. fragilis. Therefore, the functional identification of these different UTC is certainly not straightforward and requires further investigations. Furthermore, it occurred to us that, more than thirty years ago, another carbamoyltransferase was discovered by Roon and Barker [8]. A putrescine carbamoyltransferase (PTCase, EC 2.1.3.6) was found to be synthesized by the Gram-positive bacterium Streptococcus (now Enterococcus) faecalis when it was grown on agmatine but not arginine as primary energy source. This PTCase was easily separated from the OTCase synthesized by the same organism grown on arginine [8]. This putrescine carbamoyltransferase had further been studied by V. Stalon's group ([9-12]). Two features of this study – which had apparently been overlooked in recent genome annotations – appear now to be crucial for the interpretation of the data shown on Fig. Fig.1.1 A family of putrescine carbamoyltransferases In a second step, we extended this reannotation of a wrong OTCase as a PTCase to six other sequences encoded by Lactococcus lactis, Streptococcus mutans, Pediococcus pentosaceus, Lactobacillus brevis (and a very close partial sequence in Lactobacillus sakei) , Listeria monocytogenes and Mycoplasma mycoides, respectively. Indeed, these eight sequences, which have been annotated as either ArgF or ArcB (Table 1), (i) share high identity at the level of their amino acid sequence; (ii) they form a monophyletic group (Fig. (Fig.1)1
Gene context, another tool for gene reannotation In a third step, the reannotation of this clade of OTCase sequences as PTCases was confirmed by a comparative study of the neighbourhood of their encoding genes present in the four genomes completely sequenced and published (E. faecalis, Lc. lactis, S. mutans and L. monocytogenes). As shown on Fig. Fig.3,3
Thus, these four clusters of genes appear to encode the full set of enzymes which are expected to form the catabolic agmatine deiminase pathway [10]. Agmatine deiminase, PTCase and carbamate kinase were already known to become coinduced by agmatine in E. faecalis when it is used as sole energy source [10], strongly suggesting that these gene clusters are functional operons. In Pseudomonas aeruginosa PAO1, the homologous agmatine deiminase is encoded by the gene aguA belonging to an operon aguBA induced by agmatine and N-carbamoylputrescine ([16,17]) but in this species N-carbamoylputrescine is converted by a N-carbamoylputrescine amidohydrolase (EC 3.5.1.53, the aguB product) into putrescine and CO2 + ammonium rather than into putrescine and carbamoylphosphate. More recently, a similar pathway for polyamine biosynthesis has been identified by homology in higher plants [18]. In the alternative pathway corresponding to the analogous sets of genes shown on Fig. Fig.3,3 Furthermore, when we compare the clusters of genes shown in Fig. Fig.33 Conclusions Genome annotation requires both reliable tools for identifying gene function and manual expertise. The frustration due to the high percentage of orphan genes found in all genomes is often compounded with another – more vicious – problem which may occur when a very strong sequence similarity is obscuring the actual functional identity of another kind of orphan. The analysis described in this paper illustrates the difficulty in identifying such a potential source of misannotation and delineates at least two fundamental parameters which must be considered especially when the results appear to be straightforward. First, one must keep in mind that proteins sharing a high level of identical residues may have different functions. A routine step for challenging the functional annotation of any putative coding sequence should be a phylogenetic analysis. Any CDS found to branch far from its homologues in an evolutionary tree, as observed in the case of the carbamoyltransferases (Fig. (Fig.1),1 The second parameter which must be considered is the striking lack of information in the various public databases. For example, in the case studied here (the putrescine carbamoyltransferase EC 2.1.3.6) it is reported that there is no sequence available in various first-rate databases specialized in enzymatic and/or metabolic data such as ENZYME [23], BRENDA [24], KEGG [25], BIOCYC [26], etc...as well as in the Gene Ontology (GO:0050231) Consortium [27]. A significant part of this deficit of information appears to be due to not correlating biochemical data [8-11] previously published and well recorded in BRENDA [24], for example, with the incomplete amino acid sequence which was not taken into account although it had been published by the same group [13] who studied this enzyme. The specific point we would like to stress in this paper reflects a more general gap – which is widely ignored – between the enormous quantity of information buried in the sequence data and the refined knowledge built up over several decades of studies on gene regulation and protein biochemistry (recorded in [23] to [27]). In this respect, experimentally studied proteins not correlated with sequence data also qualify as "orphans". In the present case, such a resulting gap in knowledge could be bridged only because we used the experimental approach detailed in this paper. After being alerted by the unusual topology of the phylogenetic tree (Fig. (Fig.1)1 The strategy we used to identify such orphan sequences could work in any other case where it is possible to collect a body of facts about experimental data, homology, unnoticed sequence data, and accurate informations about gene context. Note that we incidentally used such a strategy to annotate the genes encoding a putative agmatine deiminase in the genomes listed in Fig. Fig.3.3 Methods Collecting sequences Near 450 carbamoyltransferases (ATCases and OTCases) sequences were collected from the public databases SwissProt, TREMBL and TREMBLNEW [15]. To facilitate the management of these data which are continuously growing up with the onset of new completely sequenced genomes, we assemble them in a relational database (available on request). Moreover, in the case of unpublished but completely sequenced genomes, it was often possible to recover bona fide sequences from specific sequencing groups sites (Joint Genome Institute [29] and Sanger Institute [30]) using either BlastP or tBlastN queries. We retained only unpublished sequences aligning along their whole length with bona fide carbamoyltransferases and sharing no less than 30% identity with it, using at least two distantly related seeds. Reconstructing phylogenetic trees Rooted phylogenetic trees were derived from multiple alignements of ATCases and OTCases using two different approaches. (1) New sequences were manually added and aligned to the previously published [4] multiple alignement using the BioEdit sequence alignment editor [31]. These additions were made effortless by introducing each new sequence near its closest partner (the first hit in a routine BlastP check). This processive approach minimized the risk of introducing any bias when adding numerous new sequences. However, the soundness of this manual alignement was routinely checked using automatic programmes (both ClustalX and DARWIN, see below) to verify that we did not miss any conserved motifs. We further ascertained this multiple alignement (especially the introduction of gaps) by using the informations available from the known 3D structures of ATCases and OTCases. Maximum parsimony and distance trees were derived from this alignment using the PROTPARS and NEIGHBOR programmes of the PHYLIP package [32], respectively. This PHYLIP package was further used to derive confidence limits for each node of either parsimony or distance trees using a bootstrap approach (programmes SEQBOOT and CONSENSE). (2) The PhyloTree programme of the DARWIN package [33] allows to build a multiple alignement and to derive a distance tree which is an approximation to maximum likelihood tree since the deduced evolutionary distances are weighted by computing their variance when reconstructing the tree. Authors' contributions NG dug up the "ancient" data on putrescine carbamoyltransferase, contributed his knowledge about carbamoyltransferases and made important additions to the manuscript. YX brought essential informations about the genetics and biochemistry of the enzymes involved in arginine metabolism and their evolution. DGN participated in the collection of new carbamoyltransferase sequences and their manual alignment and identified which sequence of E. faecalis is the putrescine carbamoyltransferase. BL carried out the phylogenetic analyses, the gene context study and drafted the manuscript which was further improved (and approved) by all authors. Acknowledgements We thank the DOE (Department of Energy, USA) and the Wellcome Trust (United Kingdom) for making available unpublished sequences from genomic projects produced by different Sequencing Groups at either the JGI [29] or the Sanger Institute [30]. This work was supported by the Flanders Foundation for Joint and Fundamental Research and by the CNRS (UMR 8621). Daniil Naumoff was supported by a postdoctoral grant from the French Ministère de la Recherche. References
|
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||
Trends Genet. 1999 Apr; 15(4):132-3.
[Trends Genet. 1999]Curr Opin Chem Biol. 2003 Apr; 7(2):230-7.
[Curr Opin Chem Biol. 2003]J Mol Evol. 1999 Oct; 49(4):461-73.
[J Mol Evol. 1999]Mol Biol Evol. 2004 Feb; 21(2):364-73.
[Mol Biol Evol. 2004]J Mol Evol. 1999 Oct; 49(4):461-73.
[J Mol Evol. 1999]J Mol Biol. 2002 Jul 19; 320(4):899-908.
[J Mol Biol. 2002]J Bacteriol. 1972 Jan; 109(1):44-50.
[J Bacteriol. 1972]Eur J Biochem. 1979 Nov 1; 101(1):143-52.
[Eur J Biochem. 1979]Arch Microbiol. 1986 Sep; 145(4):386-90.
[Arch Microbiol. 1986]J Gen Microbiol. 1989 Sep; 135(9):2453-64.
[J Gen Microbiol. 1989]Science. 2003 Mar 28; 299(5615):2071-4.
[Science. 2003]J Gen Microbiol. 1989 Sep; 135(9):2453-64.
[J Gen Microbiol. 1989]Nucleic Acids Res. 2004 Jan 1; 32(Database issue):D115-9.
[Nucleic Acids Res. 2004]Nucleic Acids Res. 2004 Jan 1; 32(Database issue):D115-9.
[Nucleic Acids Res. 2004]J Bacteriol. 1982 Nov; 152(2):676-81.
[J Bacteriol. 1982]J Bacteriol. 2001 Nov; 183(22):6517-24.
[J Bacteriol. 2001]Microbiology. 2003 Mar; 149(Pt 3):707-14.
[Microbiology. 2003]FEBS Lett. 2003 Jun 5; 544(1-3):258-61.
[FEBS Lett. 2003]Microbiol Rev. 1986 Sep; 50(3):314-52.
[Microbiol Rev. 1986]Microbiology. 2000 Aug; 146 ( Pt 8)():1815-28.
[Microbiology. 2000]Mol Phylogenet Evol. 2002 Dec; 25(3):429-44.
[Mol Phylogenet Evol. 2002]Microbiology. 2000 Aug; 146 ( Pt 8)():1815-28.
[Microbiology. 2000]Nucleic Acids Res. 2000 Jan 1; 28(1):304-5.
[Nucleic Acids Res. 2000]Nucleic Acids Res. 2004 Jan 1; 32(Database issue):D431-3.
[Nucleic Acids Res. 2004]Nucleic Acids Res. 2004 Jan 1; 32(Database issue):D277-80.
[Nucleic Acids Res. 2004]Nucleic Acids Res. 2004 Jan 1; 32(Database issue):D258-61.
[Nucleic Acids Res. 2004]J Bacteriol. 1972 Jan; 109(1):44-50.
[J Bacteriol. 1972]Nucleic Acids Res. 2000 Jan 1; 28(1):304-5.
[Nucleic Acids Res. 2000]Nucleic Acids Res. 2004 Jan 1; 32(Database issue):D258-61.
[Nucleic Acids Res. 2004]J Gen Microbiol. 1989 Sep; 135(9):2453-64.
[J Gen Microbiol. 1989]EMBO Rep. 2002 Mar; 3(3):200-3.
[EMBO Rep. 2002]Nucleic Acids Res. 2004 Jan 1; 32(Database issue):D115-9.
[Nucleic Acids Res. 2004]J Mol Evol. 1999 Oct; 49(4):461-73.
[J Mol Evol. 1999]Methods Enzymol. 1996; 266():418-27.
[Methods Enzymol. 1996]Bioinformatics. 2000 Feb; 16(2):101-3.
[Bioinformatics. 2000]