Logo of narLink to Publisher's site
Nucleic Acids Res. Jan 2009; 37(Database issue): D550–D554.
Published online Nov 16, 2008. doi:  10.1093/nar/gkn859
PMCID: PMC2686504

Génolevures: protein families and synteny among complete hemiascomycetous yeast proteomes and genomes

David J. Sherman,1,2,* Tiphaine Martin,1 Macha Nikolski,1 Cyril Cayla,1 Jean-Luc Souciet,3 and Pascal Durrens1,, for the Génolevures Consortium

Abstract

The Génolevures online database (http://cbi.labri.fr/Genolevures/ and http://genolevures.org/) provides exploratory tools and curated data sets relative to nine complete and seven partial genome sequences determined and manually annotated by the Génolevures Consortium, to facilitate comparative genomic studies of Hemiascomycete yeasts. The 2008 update to the Génolevures database provides four new genomes in complete (subtelomere to subtelomere) chromosome sequences, 50 000 protein-coding and tRNA genes, and in silico analyses for each gene element. A key element is a novel classification of conserved multi-species protein families and their use in detecting synteny, gene fusions and other aspects of genome remodeling in evolution. Our purpose is to release high-quality curated data from complete genomes, with a focus on the relations between genes, genomes and proteins.

INTRODUCTION

Since 1999, the Génolevures Consortium explores eukaryote genome evolution through the large-scale comparison of manually annotated yeast genomes. The Génolevures on-line database undergoes a major update with every significant data release: in 2004 with 13 partial genomes (1,2), in 2006 with four complete genomes (3,4) and in 2008 with four complete genomes (this work). Our purpose is to release high-quality data sets produced by the 13 partners in the Consortium, rather than to provide a site for integration of data available elsewhere, although we do integrate external data for comparison.

Yeasts are small eukaryotes covering an evolutionary range comparable to the entire Chordate phylum (5), and the unique combination of genetic and genomic tools available for yeasts make them ideal candidates for experimental study of metabolism, genetic engineering and molecular genetics. All of the many yeasts sequenced so far have small genomes (10–20 Mb) which allows for detailed comparative genomics for a relatively modest price. The economic impact of yeasts is widespread; different species are used for the production of beer, wine and bread and more recently of various metabolic products, such as vitamins, ethanol, citric acid, lipids, etc. Yeasts can degrade hydrocarbons (genera Candida, Yarrowia and Debaryomyces), metabolize xylose (Pichia stipitis), depolymerise tannin extracts (Zygosaccharomyces rouxii), and produce hormones and vaccines in industrial quantities through heterologous gene expression (6). Several human diseases are due to yeast species among them the Hemiascomycetes Candida albicans, Candida glabrata, Candida tropicalis and even Saccharomyces cerevisiae in immunocompromised patients (7). The biology of S. cerevisiae has been extensively studied for decades as a model organism for molecular genetics and cell biology studies, and as a cell factory. Its genome (8) is the most thoroughly annotated among eukaryotes, and is a common reference for the annotation of other species. Génolevures focuses on the Hemiascomycete yeasts, a homogeneous phylogenetic group which nonetheless covers a broad range of physiological and ecological lifestyles. Hemiascomycete yeast genes contain introns and alternative splicing is observed (9). Comparative genomics studies in this group have proven informative (10–16); see (5) for review.

The Génolevures Consortium sequences, annotates and analyzes complete genomes from various branches of the Hemiascomycete class, and subjects them to large-scale in silico and experimental comparisons (Table 1). From these comparisons, we produce classifications of genes, proteins and sequences to address questions of molecular evolution, such as gene conservation, specific genes, function conservation and genome remodeling. We do not provide detailed annotations of individual genes and proteins of S. cerevisiae, which are already carefully maintained by MIPS and CYGD (http://mips.gsf.de/projects/fungi) (17) and SGD (http://www.yeastgenome.org/) (18) as well as in general purpose databases such as UniProt (19) and EMBL (20).

Table 1.
Summary of genomes available in Génolevures online, and for illustration, the number of protein-coding genes classified by a selection of the different analyses made available

NEW GENOMES IN 2008

The Génolevures Consortium has released four new genomes in 2008, sequenced and assembled at high coverage and completely annotated by a team of 20 curators (Génolevures Consortium, submitted for publication). These genomes are those of Zygosaccharomyces rouxii, Saccharomyces kluyveri, Kluyveromyces thermotolerans, previously only sequenced to low or partial coverage, and Debaryomyces hansenii, partly resequenced to improve coverage. The former three, plus K. lactis (3) and Ashbya gossypii (21), are members of the Saccharomycetacae clades that are not descended from the ancestor that is thought to have undergone a whole genome duplication (unlike C. glabrata and Saccharomyces cerevisiae). The availability in the Génolevures of complete genomes for these five unduplicated genomes will allow close study of the core repertoire of protein families and functions and of the dynamics of genome remodeling.

PROTEIN FAMILIES

Extensive map reshuffling and a wide range of GC compositions from one species to another present a challenge for genome comparison in the yeasts. An essential tool for addressing this challenge is provided by ‘protein families’, a classification of protein-coding gene sequences into phylogenetic groups; members of a family are homologous and in many cases this homology is suggestive of functional similarity. The 48 889 proteins in the predicted proteomes of the nine complete genomes are classified into 7927 families, of which 4369 are common to at least two species. Of these latter families, 2591 are common to the nine species. From these families, other sub-classifications are made, such as the identification of syntenic orthologs in the hemiascomycetous yeasts (Génolevures Consortium, submitted for publication).

Families are computed as follows: four complementary sets of all against all alignments are produced by the Blast (22) and Smith–Waterman (23) algorithms, with and without filtering for homeomorphy, common domain architecture along the full length of the proteins (24). Symmetric distance matrices derived from amino acid identities are constructed and submitted to algorithmic clustering using the MCL ‘Markov clustering’ method (25,26) with a range of statistical parameters. These competing partitions are reconciled using the consensus ensemble clustering method of Ref. (27) and manually curated using both literature search and systematic comparisons with COG (28) and PIRSF (24) classifications. For a given yeast species, each family may be represented by one or several (paralogous) proteins.

Protein families are available in Génolevures at (http://genolevures.org/fam/) and tables describing all families are available in the Datasets area. Information on each family is accessible on individual pages presenting a graphical representation of the members’ relationships and links to individual members. (See URL construction rules, below.) Phyletic patterns and phylogenetic profiles are shown for each family; the former uses a simple one-letter code for each species as in Ref. (28) to indicate whether the family contains at least one member from that species, and the latter expands this to a numerical count of members for each species. Complementary information, such as GO terms, multiple sequence alignments and motif analyses are available for almost all families.

SYNTENY

Although synteny is in general poorly conserved in the Hemiascomycete yeasts (29,3,30), within phylogenetically delimited groups pairwise coverage is high and can be informative for studying genome dynamics. Pairs of homologous chromosomal regions between two species can be identified by comparing gene content and order, using protein families as an indication of gene-level homology within the regions. We identified these ‘pairwise syntenic blocks’ using the i-ADHoRE method (31) on gene homology relations identified using protein families. The approximately 19 000 pairwise synteny blocks obtained in the fashion can be interactively examined using the genome browser, as described in the Datasets area of the web site, and can be downloaded.

FUSIONS

Chromosomal rearrangements and segmental duplications may result in the creation or destruction of genes, when the breakpoint falls within the boundary of existing genes. ‘Gene fission’ occurs when a gene is broken in this way, and the 5′ (or even the 3′) fragment continues to be transcribed—creating a new, albeit truncated, gene. ‘Gene fusion’ occurs when the chromosomal rearrangement or duplication event leads to a fortuitous combination of previously existing genes, creating a new, longer gene. Fusions and fissions are mechanisms by which new genes can be formed in radical steps that do not obey the tree-like phylogenetic relations between species, and which can lead to significant changes in protein function. These events have to date been rarely taken into account in genome comparisons. Using a new method for detection of genes involved in gene fusion and fission events that makes its computations at the level of paralogous groups (32), we computed a catalogue of such events for complete genomes from the Hemiascomycete yeasts as well as other fungi. Both the paralogous groups of proteins used in the computation, and complete lists of identified fusion/fission events, are now available in Génolevures and are downloadable in the Datasets area.

OTHER DATA SETS

Tandem gene repeats are another means of gene formation. Unlike gene copies resulting from segmental duplications or retrotransposons, truncated or chimeric genes are not observed at the boundaries of tandem gene arrays. Data on tandem gene repeats are available in Datasets area.

The Génosplicing database of spliceosomal introns and intron motifs in Hemiascomycete yeasts (C. Neuvéglise, unpublished data) is available from the Datasets area, and can be used for both for the study of splicing patterns and for the development of methods for predicting gene architectures from genomic data.

The YETI classification of yeast membrane and transporter proteins (33) defined using evolutionary relationships traced using non-ambiguous functional and phylogenetic criteria derived from the TCDB (34) classification system is available in the Datasets area.

Other analyses, such as (35) considering coverage of KEGG pathways (36), are available.

EXPLORING GÉNOLEVURES DATA

The design of the Génolevures on-line database has been revamped to provide improved tools for gaining insight into the mechanisms of eukaryotic molecular evolution. The key questions in the Génolevures use cases are:

  1. What genes exist, as orthologs for my favorite gene or as members of a functional class (keyword search, alignment, homologs)?
  2. What is known about a given chromosomal element (chromosomal elements)?
  3. What relations exist in a protein family (protein families)?
  4. How are the individual genomes organized (maps and genome browser, synteny)?
  5. How are genes and proteins classified (data sets for fusions, tandem repeats, introns, pathways, YETI)?

Additionally, the web site provides a query system that simultaneously searches for and can return: genes that have or may have a translation product, RNA and other genes that may have a transcription product only, cis-active elements and cross-genome protein families.

AVAILABILITY OF DATA AND URL CONSTRUCTION RULES

All data from the Génolevures web site are freely available, and instructions for proper citation are included in each section. The Génolevures web site is developed using a ‘representational state transfer’ architecture (37) and URLs for individual identified resources built from the database can be constructed systematically. Identified resources include chromosomal elements, such as genes (prefix/elt/Abbrev/Element_identifier), protein families (prefix/fam/Family_identifier), DNA sequences (prefix/seq/Sequence_identifier) and biomolecular pathways (prefix/pathway/Pathway_identifier). Nomenclatures for genome abbreviations, systematic chromosomal element names, and protein family names are described in the Documentation area of the web site.

Génolevures uses a bespoke object model mapped to a relational database, and uses the SOFA (38) and GO (39) ontologies extensively. The web interface is developed in Mason. Key software components are provided by the NCBI Blast program suite (22) and the Stein Lab's Generic Genome Browser (40).

ONGOING DEVELOPMENTS

The Génolevures Consortium continues its effort to sequence and annotate other complete genomes of Hemiascomycete yeasts. These genomes as well as the incorporation of other genomes annotated by third parties will help to refine the classifications presented above.

SUPPLEMENTARY DATA

Supplementary Data are available Online at http://genolevures.org/.

FUNDING

French National center for Scientific Research (CNRS) (GDR 2354, partial); French National Research Agency (ANR) (ANR-05-BLAN-0331; GENARISE, partial); Région Aquitaine (‘Pôle de Recherche en Informatique’) (2005-1306001AB, partial) and ACI IMPBIO (IMPB114, partial) (‘Génolevures En Ligne’). Funding for open access charge: CNRS GDR 2354.

Conflict of interest statement. Large-scale Smith-Waterman alignments used in constructing protein families were performed using GenCore 6 software licensed from Biocceleration Inc., 77 Milltown Road, Suite B4, East Brunswick, NJ 08816, USA.

APPENDIX

The Génolevures Consortium is coordinated by J.L.Souciet and is composed of laboratories from Institut de Biologie et de Technologies de Saclay (iBiTec-S), CEA, F-91191 Gif-sur-Yvette CEDEX, France (C.Marck); Laboratoire de Chimie Bactérienne, CNRS-UPR9043, Université de la Méditerranée, 31 Chemin Joseph Aiguier 13402 Marseille Cedex 20 (E.Talla); Laboratoire Microbiologie, Adaptation et Pathogénie, Université Claude Bernard - Lyon 1, CNRS/UMR 5240, 43 Boulevard du 11 Novembre 1918, 69622 Villeurbanne, France. (M. Lemaire); Génoscope (CEA), 2 rue Gaston Crémieux, BP 191, F-91057 Evry Cedex, France (J. Weissenbach); Laboratoire Bordelais de Recherche en Informatique, LaBRI, UMR 5800 CNRS, INRIA Bordeaux Sud-Ouest (MAGNOME), 351 cours de la Libération, 33405 Talence Cedex, France (D.Sherman); Laboratoire de Microbiologie et Genetique Moleculaire, UMR 1238 INRA/2585 CNRS/Institut National Agronomique Paris-Grignon, AgroParisTech, F-78850 Thiverval-Grignon, France (C.Gaillardin); Institut Pasteur/URA2171 CNRS and Université Pierre et Marie Curie UFR927, 25 rue du Docteur Roux, F-75724 Paris Cedex 15, France (B. Dujon); Laboratoire Génétique Moléculaire, Génomique, Micribioologie, UMR 7156, Université Louis Pasteur, Institut de Botanique, 28 rue Goethe, F-67000 Strasbourg, France (S. Potier, J.-L. Souciet); Washington University School of Medicine, Department of Genetics, Campus Box 8510 4566 Scott Ave., St Louis, MO 63110 USA (M. Johnston); Washington University School of Medicine, Genome Sequencing Center, Campus Box 8501, 4444 Forest Park Avenue, St Louis, MO 63108 USA; (R. Fulton); Pasteur Genopole: Intégration et Analyse Génomique, 28 rue du Docteur Roux, F-75724 Paris Cedex 15, France (L.Frangeul); Unité de Génétique, Université Catholique de Louvain, Croix du Sud 2 bte 14, 1348 Louvain-la-Neuve, Belgique (P.Baret); Institut de Biologie Moléculaire et Cellulaire, Architecture et Réactivité de L'ARN, IBMC (UPR 9002 CNRS), 15 rue René Descartes, 67084 Strasbourg, France (E.Westhof).

REFERENCES

1. Souciet JL, Génolevures Consortium. Special issue: Génolevures. FEBS Lett. 2000;487:1–149. [PubMed]
2. Sherman DJ, Durrens P, Beyne E, Nikolski M, Souciet JL. Genolevures: comparative genomics and molecular evolution of hemiascomycetous yeasts. Nucleic Acids Res. 2004;32:D315–D318. [PMC free article] [PubMed]
3. Dujon B, Sherman DJ, Fischer G, Durrens P, Casaregola S, Lafontaine I, De Montigny J, Marck C, Neuvéglise C, Talla E, et al. Genome evolution in yeasts. Nature. 2004;430:35–44. [PubMed]
4. Sherman D, Durrens P, Iragne F, Beyne E, Nikolski M, Souciet JL. Genolevures complete genomes provide data and tools for comparative genomics of hemiascomycetous yeasts. Nucleic Acids Res. 2006;34:D432–D435. [PMC free article] [PubMed]
5. Dujon B. Yeasts illustrate the molecular mechanisms of eukaryotic genome evolution. Trends Genet. 2006;22:375–387. [PubMed]
6. Kurtzmann CP, Fell JW, editors. The Yeasts, A Taxonomic Study. 4th edn. Amsterdam: Elsevier; 1998.
7. Tawfik OW, Papasian CJ, Dixon AY, Potter LM. Saccharomyces cerevisiae pneumonia in a patient with acquired immune deficiency syndrome. J. Clin. Microbiol. 1989;27:1689–1691. [PMC free article] [PubMed]
8. Goffeau A, Barrell BG, Bussey H, Davis RW, Dujon B, Feldmann H, Galibert F, Hoheisel JD, Jacq C, Johnston M, et al. Life with 6000 genes. Science. 1996;274(546):563–567. [PubMed]
9. Neuveglise C, Chalvet F, Wincker P, Gaillardin C, Casaregola S. Mutator-like element in the yeast Yarrowia lipolytica displays multiple alternative splicings. Eukaryot. Cell. 2005;4:615–624. [PMC free article] [PubMed]
10. Cliften P, Sudarsanam P, Desikan A, Fulton L, Fulton B, Majors J, Waterston R, Cohen BA, Johnston M. Finding functional features in Saccharomyces genomes by phylogenetic footprinting. Science. 2003;301:71–76. [PubMed]
11. Kellis M, Patterson N, Endrizzi M, Birren B, Lander ES. Sequencing and comparison of yeast species to identify genes and regulatory elements. Nature. 2003;423:241–254. [PubMed]
12. Richard GF, Kerrest A, Lafontaine I, Dujon B. Mol. Biol. Evol. 2005;22:1011–1023. [PubMed]
13. Fabre E, Muller H, Therizols P, Lafontaine I, Dujon B, Fairhead C. Comparative genomics of hemiascomycete yeasts: genes involved in DNA replication, repair, and recombination. Mol. Biol. Evol. 2005;22:856–873. [PubMed]
14. Knop M. Evolution of the hemiascomycete yeasts: on life styles and the importance of inbreeding. Bioessays. 2006;28:696–708. [PubMed]
15. Rokas A, Carroll SB. More genes or more taxa? The relative contribution of gene number and taxon number to phylogenetic accuracy. Mol. Biol. Evol. 2005;22:1337–1344. [PubMed]
16. Blank LM, Lehmbeck F, Sauer U. Metabolic-flux and network analysis in fourteen hemiascomycetous yeasts. FEMS Yeast Res. 2005;5:545–558. [PubMed]
17. Mewes HW, Dietmann S, Frishman D, Gregory R, Mannhaupt G, Mayer K, Muensterkötter M, Ruepp A, Spannagl M, Stuempflen V, et al. MIPS: analysis and annotation of genome information in 2007. Nucleic Acids Res. 2008;36:D196–D201. [PMC free article] [PubMed]
18. Nash R, Weng S, Hitz B, Balakrishnan R, Christie KR, Costanzo MC, Dwight SS, Engel SR, Fisk DG, Hirschman JE, et al. Expanded protein information at SGD: new pages and proteome browser. Nucleic Acids Res. 2007;35:D468–D471. [PMC free article] [PubMed]
19. Bairoch A, Apweiler R, Wu CH, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R, Magrane M, et al. The Universal Protein Resource (UniProt) Nucleic Acids Res. 2005;33:D154–D159. [PMC free article] [PubMed]
20. Kanz C, Aldebert P, Althorpe N, Baker W, Baldwin A, Bates K, Browne P, van den Broek A, Castro M, Cochrane G, et al. The EMBL Nucleotide Sequence Database. Nucleic Acids Res. 2005;33:D29–D33. [PMC free article] [PubMed]
21. Dietrich FS, Voegeli S, Brachat S, Lerch A, Gates K, Steiner S, Mohr C, Pohlmann R, Luedi P, Choi S, et al. The Ashbya gossypii genome as a tool for mapping the ancient Saccharomyces cerevisiae genome. Science. 2004;304:304–307. [PubMed]
22. Altschul F, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–3402. [PMC free article] [PubMed]
23. Smith TF, Waterman MS. Identification of common molecular subsequences. J. Mol. Biol. 1981;147:195–197. [PubMed]
24. Wu CH, Nikolskaya A, Huang H, Yeh LS, Natale DA, Vinayaka CR, Hu ZZ, Mazumder R, Kumar S, Kourtesis P, Ledley RS, et al. PIRSF: family classification system at the Protein Information Resource. Nucleic Acids Res. 2004;32:D201–D205. [PMC free article] [PubMed]
25. Van Dongen S. PhD Thesis. The Netherlands: University of Utrecht; 2000. Graph clustering by flow simulation.
26. Enright AJ, Van Dongen S, Ouzounis CA. An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res. 2002;30:1575–1584. [PMC free article] [PubMed]
27. Nikolski M, Sherman DJ. Family relationships: should consensus reign?–consensus clustering for protein families. Bioinformatics. 2007;23:e71–e76. [PubMed]
28. Tatusov RL, Fedorova ND, Jackson JD, Jacobs AR, Kiryutin B, Koonin EV, Krylov DM, Mazumder R, Mekhedov SL, Nikolskaya AN, et al. The COG database: an updated version includes eukaryotes. BMC Bioinformatics. 2003;4:41. [PMC free article] [PubMed]
29. Llorente B, Malpertuy A, Neuveglise C, de Montigny J, Aigle M, Artiguenave F, Blandin G, Bolotin-Fukuhara M, Bon E, Brottier P, et al. Genomic exploration of the hemiascomycetous yeasts: 18. Comparative analysis of chromosome maps and synteny with Saccharomyces cerevisiae. FEBS Lett. 2000;487:101–112. [PubMed]
30. Fischer G, Rocha EP, Brunet F, Vergassola M, Dujon B. Highly variable rates of genome rearrangements between hemiascomycetous yeast lineages. PLoS Genet. 2006;2:e32. [PMC free article] [PubMed]
31. Simillion C, Vandepoele K, Saeys Y, Van de Peer Y. Building genomic profiles for uncovering segmental homology in the twilight zone. Genome Res. 2004;14:1095–1106. [PMC free article] [PubMed]
32. Durrens P, Nikolski M, Sherman DJ. Fusion and fission of genes define a metric between fungal genomes. PLOS Comput. Biol. 2008;4:e1000200. [PMC free article] [PubMed]
33. De Hertogh B, Hancy F, Goffeau A, Baret PV. Emergence of species-specific transporters during evolution of the hemiascomycete phylum. Genetics. 2005;172:771–781. [PMC free article] [PubMed]
34. Saier MH. A functional-phylogenetic classification system for transmembrane solute transporters. Microbiol. Mol. Bio. Rev. 2000;64:354–411. [PMC free article] [PubMed]
35. Iragne F, Nikolski M, Sherman D. Extrapolation of metabolic pathways as an aid to modelling completely sequenced nonSaccharomyces yeasts. FEMS Yeast Res. 2008;8:132–139. [PubMed]
36. Kanehisa M, Goto S. KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 2000;28:27–30. [PMC free article] [PubMed]
37. Fielding R, Taylor RN. Principled design of the modern web architecture. ACM Trans. Internet Techn. 2002;2:115–150.
38. Eilbeck K, Lewis SE, Mungall CJ, Yandell M, Stein L, Durbin R, Ashburner M. The Sequence Ontology: a tool for the unification of genome annotations. Genome Biol. 2005;6:R44. [PMC free article] [PubMed]
39. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, et al. Gene Ontology: tool for the unification of biology. The Gene Ontology Consortium. Nature Genet. 2000;25:25–29. [PMC free article] [PubMed]
40. Stein LD, Mungall C, Shu S, Caudy M, Mangone M, Day A, Nickerson E, Stajich JE, Harris TW, Arva A, Lewis S. The Generic Genome Browser: a building block for a model organism system database. Genome Res. 2002;12:1599–1610. [PMC free article] [PubMed]

Articles from Nucleic Acids Research are provided here courtesy of Oxford University Press
PubReader format: click here to try

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...