![]() | ![]() |
Formats:
|
||||||||||||||||||||||||
Copyright © 2007 Craxton; licensee BioMed Central Ltd. Evolutionary genomics of plant genes encoding N-terminal-TM-C2 domain proteins and the similar FAM62 genes and synaptotagmin genes of metazoans 1Medical Research Council Laboratory of Molecular Biology, Hills Road, Cambridge CB2 0QH, UK Corresponding author.Molly Craxton: molly/at/mrc-lmb.cam.ac.uk Received November 22, 2006; Accepted July 31, 2007. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. This article has been cited by other articles in PMC.Abstract Background Synaptotagmin genes are found in animal genomes and are known to function in the nervous system. Genes with a similar domain architecture as well as sequence similarity to synaptotagmin C2 domains have also been found in plant genomes. The plant genes share an additional region of sequence similarity with a group of animal genes named FAM62. FAM62 genes also have a similar domain architecture. Little is known about the functions of the plant genes and animal FAM62 genes. Indeed, many members of the large and diverse Syt gene family await functional characterization. Understanding the evolutionary relationships among these genes will help to realize the full implications of functional studies and lead to improved genome annotation. Results I collected and compared plant Syt-like sequences from the primary nucleotide sequence databases at NCBI. The collection comprises six groups of plant genes conserved in embryophytes: NTMC2Type1 to NTMC2Type6. I collected and compared metazoan FAM62 sequences and identified some similar sequences from other eukaryotic lineages. I found evidence of RNA editing and alternative splicing. I compared the intron patterns of Syt genes. I also compared Rabphilin and Doc2 genes. Conclusion Genes encoding proteins with N-terminal-transmembrane-C2 domain architectures resembling synaptotagmins, are widespread in eukaryotes. A collection of these genes is presented here. The collection provides a resource for studies of intron evolution. I have classified the collection into homologous gene families according to distinctive patterns of sequence conservation and intron position. The evolutionary histories of these gene families are traceable through the appearance of family members in different eukaryotic lineages. Assuming an intron-rich eukaryotic ancestor, the conserved intron patterns distinctive of individual gene families, indicate independent origins of Syt, FAM62 and NTMC2 genes. Resemblances among these large, multi-domain proteins are due not only to shared ancestry (homology) but also to convergent evolution (analogy). During the evolution of these gene families, duplications and other gene rearrangements affecting domain composition, have occurred along with sequence divergence, leading to complex family relationships with accordingly complex functional implications. The functional homologies and analogies among these genes remain to be established empirically. Background Synaptotagmins (Syts) share a common structure: an N-terminal transmembrane (TM) sequence followed by a variable length linker and two tandem, distinctly conserved C2 domains, C2A and C2B. Syt1 [1] identified as a protein component of synaptic vesicles, is known to be required for nervous system function, acting crucially in the fast, synchronous component of calcium regulated synaptic vesicle exocytosis [2]. Genomic analysis of Syt genes [3,4] indicates that animal genomes encode diverse sets of Syt genes but always maintain a Syt1 orthologue. Although it is likely that Syt1 orthologues function similarly [2,5-8] the functions of the other Syt genes, in different species, still remain to be established. The complexity of this task increases with the number of Syt genes and these increase along with organism complexity. The first study of the full set of Syt genes in a model organism [9] indicated that only Syt1 is expressed on synaptic vesicles. The other Syt genes were found to be expressed in different and distinct places. Many studies using different mammalian Syt genes, indicate tissue distributions which are primarily neural eg. [[10,11] and references therein]. Naturally occurring, cell type-specific expression patterns have, however, rarely been described eg. [[7,9,12,13] and references therein]. The discovery of genes in plants which are similar to Syt genes [3,4,14] further complicates functional predictions. While the plant genes and another group of animal genes (FAM62) share similarity with Syt genes, little is known about their functions. A preliminary biochemical analysis of proteins from the human FAM62 gene family has just been published [15] but growing speculation about the plant genes [16-18] necessitates a more detailed description of their similarities and differences which could usefully inform future functional studies. I have made use of the abundance of recently deposited nucleotide sequences from a wide range of organisms, to carry out a comparative genomics analysis of these genes, in order to shed light on their evolutionary relationships. Results Collection of plant gene sequences In order to undertake a comparative analysis of the plant Syt-like genes, I collected and compared full-length homologues from an evolutionary range of plants. In order to perform an unbiased search for as many homologues of these relatively unknown genes as possible, I looked at all of the primary nucleotide sequence data in the NCBI sequence databases [19]. This information is fragmentary, little of it being in the form of complete sequences, either of transcripts or genomes. By far the most abundant source of new plant sequences are ESTs, but these represent particularly small fragments and their sequences are not determined to high accuracy. I therefore needed to gather sets of overlapping ESTs to find full-length gene sequences. In order to focus the search to the detection of genuinely homologous sequences, I used nucleotide sequence probes of plant sequences already identified. Only those database sequences closely related to the probe sequence would be identified in a given search. These matching sequences were added to the collection and joined to any overlapping sequences already present in the collection. Reiterated searches served to expand the collection and extend the length of gene fragments. Had I used amino acid sequence probes to search for homologues of these genes, I would have detected a wider range of fragments with amino acid similarity, but these would not necessarily be homologous. Overlapping nucleotide sequences would be required in any case, to piece together whole genes from the identified EST fragments, so the simplest strategy to gather full-length relatives of these genes was to use nucleotide probes. I avoided gathering processed sequences in the sequence databases: these include genes predicted from genome annotation pipelines, as well as the vast majority of amino acid sequences which are predicted from nucleotide sequences. These sequences may not be accurate and could mislead subsequent analyses if used without verification. So I carried out reiterated rounds of blastn searching of nucleotide sequences at NCBI [19]. In the first few rounds, I used probes representing the plant gene coding sequences I had already identified (genes 85 to 117) [4]. After each round, I collected all of the statistically significant hits with high scoring segments longer than 30 nucleotides and assembled these sequences into a gap4 database [20]. Repeated searching with different probes, followed by gap4 assembly of only previously uncollected hits, allowed me to gradually but efficiently build a comprehensive collection. Each probe detected a unique spectrum of homologous plant sequences. Probes from a given species could be used to find similar sequences from related species. Probes covering more conserved regions could be used to find sequences from a wider range of relatives. Sequences from closely related species could be used to bridge non-overlapping contigs from a single species. In the later stages of the collection process, I carefully separated the contigs so that in most cases, each represents a set of overlapping sequences from one species only. As a final step, to ensure that the collection was as comprehensive as it could be at this time, I searched the nucleotide sequences at NCBI using tblastn with amino acid sequence probes and confirmed that the top scoring hits had already been collected. As well as examining transcript sequences, I also collected genomic sequences where available. I particularly wanted to examine the genome of Physcomitrella patens which is currently being sequenced [21]. I had previously identified Syt-like genes in the genome sequences of Arabidopsis thaliana and Oryza sativa but both of these represent relatively recently evolved angiosperms whereas the moss genome represents an ancient bryophyte. I used the trace archive at NCBI [22] as well as resources at PHYSCObase [23] where transcript sequences are also available. I confirmed the genomic and transcript sequences from several Physcomitrella patens gene loci and deposited these sequences in the public databases [EMBL:AM140045, EMBL: AM140046, EMBL: AM140047, EMBL: AM140048, EMBL: AM140049, EMBL: AM140050]. In contrast to animal Syt genes, which appear to increase in number along with organism complexity [4], I found that the haploid genome of Physcomitrella patens has even more of these plant genes (19 or more) than either Oryza sativa (13) or Arabidopsis thaliana (11). Additional file 1 lists full details of each gene identified. Additional file 2 lists alphabetically, in rough phylogenetic order, all of the plant species in which genes in this collection have been identified. Genes were identified in a wide evolutionary range of land plants, from bryophytes to rosids. Analysis of full-length plant genes Database searching identified six distinct groups of plant genes. Since all of the genes encode relatively long proteins, most of the collection comprises gene fragments which cannot yet be extended to full-length. Only where a large number of overlapping sequences were available was it possible to derive full-length gene sequences from EST contigs. Consequently, the full-length sequences represent the relatively abundantly transcribed, or the shorter genes. Genomic sequences were useful for identifying full-length sequences, irrespective of transcript abundance, as well as for providing the intron-exon structure of the gene. Full-length amino acid sequences were compared using Multalin [24]. The previously used nomenclature (SytA, SytB, SytC etc.) following [14] is somewhat arbitrary and is inadequate for a consistent and meaningful description of these plant genes. I propose the following naming convention for these plant N-terminal-TM-C2 domain genes: NTMC2Type1.1, NTMC2Type1.2, NTMC2Type6 and so on. Multiple alignments of full-length sequences from each group are presented in figures figures1,1
Figures to to1,1 Collection of animal FAM62 genes I had previously identified genes in metazoans and non-metazoans which encode N-terminal-TM-C2 domain proteins sharing similarity with those of plants [4]. In the meantime, with the annotation of the human genome, the three members of this gene family in Homo sapiens have been named FAM62A, FAM62B and FAM62C [25]. I sought to identify homologues of these genes in other organisms by tblastn searching genomic sequences, thereby identifying full-length genes and their intron-exon structures. In contrast to the current status of primary nucleotide sequences from plants, many more animal genomic sequences are available to search. One reason for this is that animal genomes are relatively small in comparison to plant genomes and are therefore relatively less expensive to sequence. After identifying FAM62 gene homologues in genomic sequences, I searched transcript sequences using blastn with nucleotide probes, to confirm the predicted gene structures. I identified FAM62 homologues in a range of metazoan genomes. Details of each gene are listed in additional file 3. Analysis of full-length FAM62 genes Full-length amino acid sequences were compared using Multalin [24]. Figure Figure77
Figure Figure99
Analysis of the structure of Syt genes Collection and analysis of the plant NTMC2 genes and animal FAM62 genes revealed intron patterns which are highly conserved within the different groups, implying a long evolutionary history for the whole length of each gene. I have previously looked at the intron patterns of Syt genes and found strong conservation of particular intron positions [3,4]. To make clear the differences between the plant and animal N-terminal-TM-C2 domain genes and Syt genes which are also N-terminal-TM-C2 domain genes, I analyzed the intron positions within the coding regions of Syt genes from a wide a range of metazoans. Details of Syt genes shown here but not previously reported [4] are in additional file 4. Figure Figure1111
In figure figure11,11 Discussion I have examined groups of genes in plants and animals which encode N-terminal TMs followed by a linker and one or more C2 domains. The NTMC2 genes and the FAM62 genes share sequence similarity in the linker region between the N-terminus and the first C2 domain. This region has recently been identified as a conserved domain of unknown function named SMP [30]. The NTMC2 genes have one or two C2 domains and the FAM62 genes have three or more C2 domains. The plant genes and the animal genes each have modular gene structures with conserved intron positions. Figure Figure1313
FAM62-like genes are identifiable in yeasts and fungi, but their more divergent sequences and general lack of introns set them apart from the group of metazoan FAM62 genes and I have not analysed them here. I have identified similar genes in other non-metazoans, such as Trypanosoma brucei, Ostreococcus tauri and Cyanidioschyzon merolae, but these too are quite divergent and lack introns (details in additional file 5). All of the full-length nucleotide sequences in this paper are listed in additional file 6. All of the full-length amino acid sequences in this paper are listed in additional file 7. The NTMC2Type1, NTMC2Type2 and NTMC2Type3 genes are Syt-like, in that they have an N-terminal TM and two separately conserved C2 domains. Their conserved intron patterns distinguish them from Syt genes which have only been found in metazoans and have their own distinctive intron patterns. The NTMC2Type1, NTMC2Type2 and NTMC2Type4 genes are highly similar up to the first C2 domain, indicating a possible gene fusion or fission. A gene fission event is apparent in the genes encoding Doc2 and Rabphilin proteins (figure (figure12,12 The difficulty of applying a consistent and meaningful gene nomenclature is highlighted by this work. In the past, gene naming was usually the result of slow and painstaking research. Genes were given names indicating a phenotype or functional aspect of an expressed product. Now in the genome era, vast numbers of genes are appearing at great speed. To make sense of all this new information, evolutionary genomics [29] aims to dissect the complex relationships between genes in different life forms over evolutionary time scales, thereby improving genome annotation. Genes can express multiple functional products and be regulated differently in different contexts. This means that it cannot be straightforward to predict the functional consequences of variations at particular genomic loci, in different species or even different individuals. Functional annotation of genomes is therefore not a straightforward task. There is already confusion with Syt nomenclature (see for example SYT5, Syt5, SYT9 and Syt9 in the Gene and Pubmed databases at NCBI). Equivalent genomic loci in different species can be given different names through separate genome annotation pipelines, and the community of researchers engaged in functional studies of the gene products, continue to supply yet more names relating to the particular functions they have studied (for example, see [15]). In this paper I have named the Sytα genes, which lack human homologues, in line with [9]. I have named those with human homologues, according to the HUGO gene nomenclature committee approved human gene names [25]. Three Syt genes in Caenorhabditis elegans remain unclassified at present and I have simply numbered them (1) to (3) for now. The Wormbase [31] nomenclature for Caenorhabditis elegans Syt genes: snt-1 to snt-6 does not (apart from snt-1 being numbered consistent with its relationship to other Syt1 genes) yet take account of their evolutionary relationships. Flybase [32]Syt gene names are currently restricted to three of the seven Syt genes in Drosophila melanogaster: Syt1,4 and 7 (yet see [33] where four Syt genes were identified in Drosophila melanogaster, but only two of these match Flybase Syt genes, likely due to inaccuracies in the source databases used). While the Homo sapiens and Mus musculus genes encoding Rabphilin have now been named RPH3A and Rph3a, respectively, the genes encoding Doc2 proteins have not yet acquired genome nomenclature committee approved names. I named the FAM62 genes in this paper according to the HUGO gene nomenclature committee approved names, but these names have no functional meaning. I suggest a nomenclature for the plant genes which describes their domain composition. This may have some functional relevance. For the future annotation of genomes with homologues of the genes discussed here, it would be useful to incorporate these gene predictions into the sequence databases such that they are obviously visible and appropriately connected. This should be possible via the recently introduced Third Party Annotation (TPA) facility at the NCBI and EMBL nucleotide sequence databases. Genome annotation needs to be updated continuously and the information from separate genome projects integrated. A possible wiki solution to the problem of updating genome annotation has recently been proposed [34]. Conclusion A comparative genomics analysis of genes with N-terminal-TM-C2 domain architectures helps to understand how these genes have evolved. Although it is not possible to draw firm conclusions about the total gene complement of organisms from incomplete genome sequences, such information is needed for sound inferences about the origin and diversification of gene families. The examination of a wide variety of fragmentary sequences does, however, provide much information, useful both for understanding the evolution of genes and their functional products. Large scale, structure-based comparisons of protein sequences inform functional perspectives on the evolution of protein repertoires eg. [[35-37] and references therein]. A structural analysis of eukaryotic C2 domain proteins [38] has considered the evolution of this particular domain. For more gene-oriented perspectives, see eg. [29,39,40] and for a consideration of non-coding sequence evolution, see eg. [41,42]. The collection of genes used here, includes evolutionarily widely dispersed genes with distinctive intron-exon patterns. It includes several gene families with long evolutionary histories. The origins of these gene families are not yet clear but appear to be several. Genome sequences from more lineages of simple, deep-branching eukaryotes may, in future, reveal the earlier histories of these gene families. The collection demonstrates different modes of gene evolution: the C2 domain duplication of FAM62A genes, the whole gene duplication of the Tribolium castaneum FAM62 genes and Mus musculus Doc2 genes, the alternative exons of the C2-1 domain encoded by insect FAM62 genes, the gene fusion/fission of NTMC2Type2/NTMC2Type4 and Rabphilin/Doc2 genes, and the expansion and diversification of the Syt gene family. Intron gains and losses are also demonstrated. Intron movements in the duplicated Tribolium castaneum FAM62 genes and intron movement with functional consequences in the NTMC2Type2 genes are interesting examples. The mechanisms of intron gain and loss and the causes of intron evolution are matters of considerable debate [39,43]. This gene collection provides some useful information for this area of investigation. Different gene products in this collection share a domain architecture which implies membrane proteins tethered by TM domains, which via their C2 domains, interact with lipids, other membranes and other proteins, sometimes in a calcium regulated manner. Functional studies on many of these genes have yet to be undertaken. It remains to be seen exactly what levels of functional equivalence exist even between different members of the same gene family, for example, the Syt gene family. An empirical approach to investigating the functions of plant NTMC2 genes and animal FAM62 genes would therefore seem more wise than attempting to make functional predictions based on their shared structural domains, which are not homologous. Improved understanding of the evolutionary relationships among these genes will help to guide and interpret future functional studies as well as informing the effort to annotate genome sequences. I hope that innovations in gene and genome annotation will in future allow the easy integration of new results from functional studies and that new functional studies can likewise be informed by evolutionary considerations based on good annotation. Complex, eukaryotic genes are difficult to predict accurately from genome sequences and need to be verified by comparison with transcript sequences. This is especially important when subtle gene regulation by alternative splicing and RNA editing is involved. Ideally, in time, it will be possible to integrate all sources of data into a comprehensible resource. Methods Cloning and sequencing of Physcomitrella patens genes Physcomitrella patens genomic DNA was a gift from Didier Schaefer. I used this as a template for PCR reactions. I amplified genomic regions using Pfu turbo polymerase with phosphorylated primers and cloned the products into Sma digested pBSIIKS-. After sequencing, overlapping clones were selected and digested with restriction enzymes in such a way as to ligate the genomic locus into one piece. The sequence of each genomic clone was deposited in the public sequence databases [EMBL:AM410046, EMBL:AM4100449, EMBL:AM410050]. cDNA clones, also gifts from Didier Schaefer, were obtained from the M. Hasebe collection [44] at PHYSCObase [23] and sequenced completely. These sequences were deposited in the public sequence databases [EMBL:AM410045, EMBL:AM410047, EMBL:AM410048]. Acknowledgements I wish to thank Didier Schaefer and Boris Voigt for their gifts of plant DNAs. References
|
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||||||||||
Nature. 1990 May 17; 345(6272):260-3.
[Nature. 1990]Cell. 1994 Nov 18; 79(4):717-27.
[Cell. 1994]Genomics. 2001 Sep; 77(1-2):43-9.
[Genomics. 2001]BMC Genomics. 2004 Jul 6; 5(1):43.
[BMC Genomics. 2004]J Biol Chem. 1991 Jan 5; 266(1):615-22.
[J Biol Chem. 1991]BMC Genomics. 2004 Jul 6; 5(1):43.
[BMC Genomics. 2004]BMC Genomics. 2004 Jul 6; 5(1):43.
[BMC Genomics. 2004]Nucleic Acids Res. 1988 Nov 25; 16(22):10881-90.
[Nucleic Acids Res. 1988]J Biochem. 2003 May; 133(5):641-9.
[J Biochem. 2003]BMC Genomics. 2004 Jul 6; 5(1):43.
[BMC Genomics. 2004]Nucleic Acids Res. 1988 Nov 25; 16(22):10881-90.
[Nucleic Acids Res. 1988]Genomics. 2001 Sep; 77(1-2):43-9.
[Genomics. 2001]Nature. 2005 Mar 17; 434(7031):409-13.
[Nature. 2005]J Neurochem. 2004 Apr; 89(2):354-63.
[J Neurochem. 2004]Genomics. 2001 Sep; 77(1-2):43-9.
[Genomics. 2001]BMC Genomics. 2004 Jul 6; 5(1):43.
[BMC Genomics. 2004]FEBS Lett. 1996 Jan 8; 378(2):135-9.
[FEBS Lett. 1996]BMC Genomics. 2004 Jul 6; 5(1):43.
[BMC Genomics. 2004]J Cell Biol. 2004 Jul 19; 166(2):249-60.
[J Cell Biol. 2004]Annu Rev Genet. 2005; 39():309-38.
[Annu Rev Genet. 2005]FASEB J. 2006 Feb; 20(2):202-6.
[FASEB J. 2006]Annu Rev Genet. 2005; 39():309-38.
[Annu Rev Genet. 2005]Proc Natl Acad Sci U S A. 2007 Mar 6; 104(10):3823-8.
[Proc Natl Acad Sci U S A. 2007]J Cell Biol. 2004 Jul 19; 166(2):249-60.
[J Cell Biol. 2004]Genome Biol. 2006; 7(11):R105.
[Genome Biol. 2006]Genome Biol. 2007; 8(1):102.
[Genome Biol. 2007]Nature. 2002 Nov 14; 420(6912):218-23.
[Nature. 2002]J Mol Biol. 2003 Oct 24; 333(3):621-39.
[J Mol Biol. 2003]Annu Rev Genet. 2005; 39():309-38.
[Annu Rev Genet. 2005]Nat Rev Genet. 2006 Mar; 7(3):211-21.
[Nat Rev Genet. 2006]Biol Direct. 2006 Sep 19; 1():29.
[Biol Direct. 2006]Nat Rev Genet. 2006 Mar; 7(3):211-21.
[Nat Rev Genet. 2006]Annu Rev Genet. 2006; 40():47-76.
[Annu Rev Genet. 2006]Proc Natl Acad Sci U S A. 2003 Jun 24; 100(13):8007-12.
[Proc Natl Acad Sci U S A. 2003]