![]() | ![]() |
Formats:
|
||||||||||||||
Copyright © 2007 by the Genetics Society of America Why Are There Still Over 1000 Uncharacterized Yeast Genes? Banting and Best Department of Medical Research, University of Toronto, Toronto, Ontario M5S 3E1, Canada 1Corresponding author: Banting and Best Department of Medical Research, 160 College St., Room 1302, University of Toronto, Toronto, ON M5S 3E1, Canada. E-mail: t.hughes/at/utoronto. Communicating editor: A. Spradling This article has been cited by other articles in PMC.Abstract The yeast genetics community has embraced genomic biology, and there is a general understanding that obtaining a full encyclopedia of functions of the ~6000 genes is a worthwhile goal. The yeast literature comprises over 40,000 research papers, and the number of yeast researchers exceeds the number of genes. There are mutated and tagged alleles for virtually every gene, and hundreds of high-throughput data sets and computational analyses have been described. Why, then, are there >1000 genes still listed as uncharacterized on the Saccharomyces Genome Database, 10 years after sequencing the genome of this powerful model organism? Examination of the currently uncharacterized gene set suggests that while some are small or newly discovered, the vast majority were evident from the initial genome sequence. Most are present in multiple genomics data sets, which may provide clues to function. In addition, roughly half contain recognizable protein domains, and many of these suggest specific metabolic activities. Notably, the uncharacterized gene set is highly enriched for genes whose only homologs are in other fungi. Achieving a full catalog of yeast gene functions may require a greater focus on the life of yeast outside the laboratory. SEVERAL years ago we projected that, judging from the steady and predictable increase in database entries, all yeast genes would have “known” functions by mid-2007 (Hughes et al. 2004). This was an optimistic estimate, since what was known at the time about many of the “known” genes was unsatisfying (or even conflicting) in terms of actual understanding of physiological purpose or molecular role. Nonetheless, it seemed reasonable, and still does, that yeast will be the first organism for which the functions of all the genes are characterized to a degree that would satisfy most molecular biologists. Characterizing the functions of all of the yeast genes is not a senseless academic pursuit. Yeast is a part of everyday life, being critical to baking and brewing. Yeasts can also be pathogenic. Much of what is known about how all eukaryotic cells work has come from studying yeast. And the possibility that something would be known about virtually all yeast genes only a decade or so after the initial sequencing [at which time only ~1000 genes appeared in the literature, and roughly half could be ascribed some function on the basis of sequence features (Goffeau et al. 1996)] could be taken as a harbinger of what might eventually be anticipated from similar efforts in more complicated organisms with larger genomes and more genes, such as humans. Examination of the path taken to such an achievement, including any hurdles or missteps, should provide a level of guidance. Although many organisms now have a genome sequence and a set of useful genetic tools, yeast has been the major proving ground for large-scale application of new technologies, and in many respects remains the advance guard of functional genomics, proteomics, and systems biology, as outlined in other recent reviews (Bader et al. 2003; Mustacchi et al. 2006; Suter et al. 2006). Knowing what all the parts do is important if you want to know how a machine works. It is now mid-2007. Are the functions of all 6000 yeast genes “known”? Given a very liberal definition of known function, this goal has nearly been reached. The commercial database used in our initial time line (Hughes et al. 2004) is no longer freely available. We therefore examined information present in the Saccharomyces Genome Database (SGD) (Nash et al. 2007), the public database that curates the yeast genes and also compiles Gene Ontology (GO) annotations, publications, interactions, and a host of additional information. As of March 20, 2007, there are only 38 genes with no information available whatsoever, and only 566 lack any annotations in any of the three major branches of Gene Ontology (biological process, molecular function, or cellular component), which is a somewhat fluid categorical data structure that incorporates information with a variety of evidence levels (Ashburner et al. 2000). This is, to be sure, a phenomenal achievement, having occurred mostly in a single decade. Unfortunately, much of this annotation is derived from large-scale studies, and depth is lacking for many genes. In the last 3 years, the overall proportion of genes that SGD classifies as “uncharacterized ORF” has not changed a great deal, decreasing by only 285 genes (Figure 1
As of March 20, 2007, there are still 1253 uncharacterized yeast genes listed on SGD—21% of all genes. Why so many? What might their functions be? How will we go about characterizing them? Previous analyses have outlined the relative strengths of specific approaches in functional genomics and proteomics, usually using cross-validation (i.e., sensitivity and specificity at reproducing the genes whose functions are already known in blinded tests) (von Mering et al. 2002; Wong et al. 2005; Myers et al. 2006). All methods work in at least some instances and, in selected cases, have been extremely successful. RNA and ribosome biogenesis categories, for example, which include hundreds of genes, often dominate functional predictions, since these genes and their encoded proteins share regulatory patterns, exist in relatively abundant protein complexes, and often contain conserved sequence features. Consequently, investigators have capitalized on large-scale data sets to study new RNA and ribosome biogenesis genes and their products (e.g., Grandi et al. 2002; Peng et al. 2003). It is less clear how useful predictions from large-scale data are in all classes of gene function, however. Our previous analysis (Hughes et al. 2004) included predicted GO annotations for 122 proteins that were uncharacterized at the time, but whose GO assignments were supported by three or more large-scale data sets. While it is encouraging that 82 of these have since been characterized, only 23 were assigned to one of the precise predicted annotation categories, and 12 of these were in RNA processing or ribosome biogenesis. This is certainly higher than random, but cautions that predictions might best be used as a rough guide to areas for further exploration. Thus, rather than evaluating the relative merits of the methods, here we instead consider the uncharacterized genes themselves, with an eye to why they are not yet known, how their functions might be understood, and what lessons might hold for similar efforts in other genomes. ARE ALL OF THE UNCHARACTERIZED GENES REAL? A trivial explanation for difficulty in associating uncharacterized genes with specific functions would be that they are not real genes. The yeast gene catalog changes occasionally as gene boundaries are redefined and small expressed ORFs are discovered (Fisk et al. 2006). Expression or even localization (discussed below), which are often taken as evidence that a gene is real, is not on its own conclusive proof of function. Even mutant phenotypes are not foolproof, as they can be caused by disruption of regulatory sequence. Anecdotally, microarray analysis of yeast ORF deletion strains occasionally reveals misregulation of an adjacent gene on the chromosome (T. R. Hughes, unpublished observation), which could, in principle, have a phenotypic consequence that is unrelated to the function of the deleted gene. These caveats partially underlie the general notion that gene characterization should be based on more than one independent line of evidence. Perhaps the best evidence that most of the uncharacterized yeast genes are bona fide genes comes from sequence attributes. The vast majority are conserved in syntenic positions in related yeasts, indicating selective pressure for retention (Cliften et al. 2003; Kellis et al. 2003) (Figure 2
ARE THE UNCHARACTERIZED GENES TOO NEW TO HAVE BEEN STUDIED? Another explanation for lack of functional knowledge on many yeast genes is that they have only recently been discovered. Systematic yeast gene names (e.g., YAL001W) added after the initial sequence assembly (Goffeau et al. 1996) typically carry a suffix (e.g., YAL001W-A) (Fisk et al. 2006). A total of 204 of the uncharacterized genes carry such a suffix, while only 285 genes overall carry such a suffix [not including “dubious ORFs,” which are no longer believed to be genes (Fisk et al. 2006)]. Not being in the database clearly puts a gene at a disadvantage for characterization: 188 of the 204 uncharacaterized genes carrying a suffix were not in the initial deletions consortium collection (Giaever et al. 2002), which has formed the basis of hundreds of genetic screens over the last 5 years; similarly, 156 are not in the TAP/GFP collections (Ghaemmaghami et al. 2003; Huh et al. 2003). However, on the whole, most of the uncharacterized genes have been in databases for over a decade. The majority (982) are present in the initial deletions consortium collection and many others have been added subsequently (Kastenmayer et al. 2006) (Figure 2 DO UNCHARACTERIZED GENES HAVE ANY DISTINGUISHING CHARACTERISTICS IN LARGE-SCALE ANALYSES? Uncharacterized genes tend to be underrepresented in virtually all types of large-scale data (Hughes et al. 2004), and this may partially explain why they are not studied as frequently; however, we found no criteria that completely separate currently uncharacterized from characterized genes. One of the greatest differences in distribution of properties is requirement for viability: <4% of uncharacterized genes (40) are listed as essential, in comparison to ~19% of genes on the whole (Fisk et al. 2006). In the GFP localization database (Huh et al. 2003), there are no striking differences in distribution among subcellular categories, and while 72% of “verified” ORFs are localized (3491/4505), in comparison to 57% of uncharacterized ORFs (629/1094), this is at least partially due to the use of localization information in gene characterization since 2003. The difference in the distribution of expression levels between characterized and uncharacterized is significant, although very far from a bimodal distribution (Figure 3
ARE THE UNCHARACTERIZED GENES NEEDED ONLY UNDER CERTAIN CONDITIONS? Accepting that the current set of uncharacterized genes are in large part real genes, that most of them are present in current reagent sets (including the deletions collection), and that many are expressed under standard laboratory conditions and participate somehow in the life of the cell, we are faced with the possibility that there is something about their biology that makes them difficult to characterize. That they tend to lack strong phenotypes indicates that many of them are not needed under standard laboratory conditions. Two general explanations are (1) that the uncharacterized genes are redundant with other genes or with each other or (2) that they are important only under conditions that are not usually assayed in the laboratory. These explanations are related, are not mutually exclusive, and one or both may apply to only a subset of genes. Are these explanations supported by data? Redundancy carries several expectations. First, since one mechanism for achieving redundancy is gene duplication, we might expect the uncharacterized ORFs to contain many duplicated genes. In fact, 161 uncharacterized proteins have sequences at least 50% identical to the sequences of another uncharacterized protein (median blast E-value 10−51), and these 161 proteins cluster into 54 groups (Figure 4
Second, redundancy might result in a substantial number of genetic interactions with and among uncharacterized genes. Genetic redundancy with >100 genes has been measured systematically across all strains in the deletion collection (Tong et al. 2004). In the BioGrid database (Stark et al. 2006), only 197 (or 18%) of uncharacterized genes have at least one genetic interaction, in comparison to 3275 (or 72%) of verified ORFs. This is far from conclusive evidence, as only a small proportion of the full genetic interaction network has been mapped (even under a single growth condition) and relatively few uncharacterized genes have been crossed to all other deletions. In general, it is more difficult to understand the mechanistic basis of genetic interactions than it is for physical interactions, and there is still some debate regarding expectation and scoring of double-mutant crosses. Nonetheless, this result does raise the possibility that at least some of the uncharacterized genes exist solely as genetic buffers, but also suggests that high genetic redundancy does not distinguish them from characterized genes. If the uncharacterized genes are important only under specific conditions—particular situations that are not normally tested in the laboratory—then we would expect mutation of these genes to have little or no phenotype in general, although they may be constitutively expressed. Results obtained to date do roughly match this expectation. Another expectation of genes required to survive specific situations in nature is that their sequence features may be indicative of activities related to interaction with the environment. Many of them do contain sequence features indicating that they are biosynthetic enzymes (147) or transporters/permeases (40), consistent with a role in dealing with biochemistry in the wild. Consistent with this notion, the uncharacterized gene set contains an extremely significant number of genes (177 of 405) that have a homolog in one or more other fungi, but not in any other organisms (Nishida 2006). The set of 228 characterized genes with exclusively fungal homologs is enriched for yeast-specific functions, including genes involved in cell-wall organization and biogenesis, mating, and cell–cell adhesion (Robinson et al. 2002). HOW WILL WE COMPLETE THE ENCYCLOPEDIA OF THE CELL? To summarize the evidence, a variety of factors are likely to contribute to the relatively large number of as-yet-uncharacterized yeast genes, including genetic redundancy, lack of strong growth phenotype, and the possibility that not all of them are real genes. In addition, many of the remaining genes may be involved in environmental and metabolic responses or growth modes not normally queried in the laboratory. Thus, despite the success of high-throughput methods in many instances, characterization of these remaining refractory genes may require a different tack. Certainly there is no shortage of researchers and enthusiasm. On the basis of first-author names on articles cataloged at SGD, there were at least 9447 yeast researchers active between 2003 and 2007. Yeast molecular biologists typically jump at the chance to describe a new gene function, especially if it involves a topic of current interest; for example, RTT109, although previously known to influence Ty transposition (Scholes et al. 2001), was simultaneously described as encoding a histone H3-K56 acetyltransferase by at least five different groups in late 2006 and early 2007 (Schneider et al. 2006; Collins et al. 2007; Driscoll et al. 2007; Han et al. 2007; Tsubota et al. 2007). Indeed, it does appear that present approaches to gene characterization result in a gradual increase in the proportion of genes with known functions (Figure 1 First, a great deal might be accomplished simply by having seasoned yeast researchers systematically peruse the list of uncharacterized genes and their attributes—including what is known or predicted about them from sequence features or large-scale surveys—and encouraging those engaged in large-scale efforts to focus on the uncharacterized genes. Figure 5
Given the large number of apparent biosynthetic enzymes and mitochondrial proteins, a global metabolite profiling system would be extremely helpful. Such a system has not yet entered into the mainstream, although the technology exists (Griffin 2006). Nonetheless, biochemical genomics (Martzen et al. 1999), in which genes are characterized by the activities of their purified products, and the related protein chip approach that deposits the same proteins on a microscope slide for parallel analysis (Zhu et al. 2001) appear to be partly filling the same niche. A surprising number of tRNA-modifying activities have been identified using biochemical genomics, including that encoded by TRM13, which is still listed on SGD as uncharacterized but was published this month as a tRNA 2′-O-methyltransferase (Wilkinson et al. 2007). Structural genomics is a form of “reverse” biochemical genomics: SDO1, which was changed from “uncharacterized” to “verified” by SGD in March 2007, was found by structural genomics to have an OB fold (a common oligonucleotide-binding motif), suggesting interaction with RNA, consistent with its coregulation with other RNA biogenesis factors (Savchenko et al. 2005). Data published this month show that SDO1 specifically controls maturation and translational activation of ribosomes (Menne et al. 2007). Chemical genetic profiling (Giaever et al. 2002; Parsons et al. 2006), which simultaneously probes the sensitivity of all of the deletion strains to a small molecule—often a metabolic inhibitor or an environmental chemical stress—also qualifies as a form of biochemical genomics and has the advantage of inherently revealing phenotypes. While these examples highlight fundamental aspects of cell biology, which are amenable to laboratory study, it appears as if more exploration of the life of yeast in the wild would also be productive, on the basis of the fact that half of the fungal-specific genes remain uncharacterized. Examination of the lifestyle and genetics of yeast in its ecological niche is still in its infancy. In nature, Saccharomyces cerevisiae appears to be one of many opportunistic microorganisms found in rotting fruit (Mortimer and Polsinelli 1999; Fleet 2003). The cells presumably are dormant much of the time. Fruit flies prey on yeast, but they and other insects are also critical to the yeast life cycle, as they are believed to be the primary vector for transport of yeast (Mortimer and Polsinelli 1999). Thus, ability to stick to insects through dormancy, without being eaten, might be a requisite for propagation of wild yeast. Chemical defenses and competitive nutrient utilization strategies are also undoubtedly essential for wild yeast; even human-made fermentations contain a wide variety of interacting microbes, including yeasts, other fungi, and bacteria (Fleet 2003). Indeed, in addition to the biosynthetic enzymes and transporters listed above, the uncharacterized genes include nine zinc-cluster transcription factors [most characterized members of this class of proteins regulate metabolic pathways or responses to environmental stress (MacPherson et al. 2006)] and a smattering of other proteins that appear likely to be involved in environmental adaptation. YNL234W, for example, is a direct target of Rgt1 (a glucose-responsive transcription factor) and encodes a functional globin (Sartori et al. 1999; Kaniak et al. 2004); however, its physiological purpose remains elusive. Other uncharacterized proteins may be related to the cell surface or cell shape; YOR111W, for instance, consists almost entirely of a Maf domain, named for a protein that influences septum formation in bacteria by an unknown mechanism (Butler et al. 1993). Some environmental behaviors, such as filamentous growth (Kron 1997), might be better modeled in strains in addition to S288C; for instance, the filamenting strain Sigma 1278B is amenable to genetic study in the laboratory (e.g., Lorenz and Heitman 1998). Finally, it may be time to reconsider the argument that gene functions need not manifest as cut-and-dried laboratory single-mutant phenotypes to be selected over evolutionary time. Marginal but reproducible fitness contributions have previously been noted for many single mutants in nonessential yeast genes (Thatcher et al. 1998), and more recent efforts to compile synthetic genetic interactions have indicated that the full yeast genetic network may encompass as many as 200,000 pairwise interactions (Davierwala et al. 2005). The effects of loss of tRNA modifications appear to be synthetic in at least some cases (Alexandrov et al. 2006), and these modifying enzymes seem to be among the more refractory genes with respect to rapid characterization. Similar genetic behavior might be expected of other currently uncharacterized genes. At the time of the original sequencing effort, only an estimated 1000 of 6000 apparent genes had been previously described in a research article (Goffeau et al. 1996); a decade later, the running total is 4687, counting only genes that appear in focused articles (those that mention 10 or fewer yeast genes) (Fisk et al. 2006). By any standard, this is a phenomenal accomplishment, giving us confidence that the full gene complement can be characterized and optimism that similar accomplishments will be made during our lifetimes in other organisms, perhaps even mammals. Current technology appears to be sufficient to characterize most gene products at a cellular level, provided they are active and nonredundant under the conditions examined. It is possible to generate hypotheses regarding functions and associate them with confidence levels, but someone still has to follow them up. There is clearly a need for human inference and domain knowledge in the creation of new approaches for specific problems, including the characterization of individual genes and their roles in nature. Acknowledgments We are grateful to Mike Cherry, Dianna Fisk, Owen Ryan, Charlie Boone, Brenda Andrews, Guri Giaever, Allan Spradling, Mark Johnston, and Linda Bisson for helpful discussions and feedback on the manuscript. References
|
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||
Curr Opin Microbiol. 2004 Oct; 7(5):546-54.
[Curr Opin Microbiol. 2004]Science. 1996 Oct 25; 274(5287):546, 563-7.
[Science. 1996]Trends Cell Biol. 2003 Jul; 13(7):344-56.
[Trends Cell Biol. 2003]Yeast. 2006 Feb; 23(3):227-38.
[Yeast. 2006]Biotechniques. 2006 May; 40(5):625-44.
[Biotechniques. 2006]Curr Opin Microbiol. 2004 Oct; 7(5):546-54.
[Curr Opin Microbiol. 2004]Nucleic Acids Res. 2007 Jan; 35(Database issue):D468-71.
[Nucleic Acids Res. 2007]Nat Genet. 2000 May; 25(1):25-9.
[Nat Genet. 2000]Nature. 2002 May 23; 417(6887):399-403.
[Nature. 2002]Trends Genet. 2005 Aug; 21(8):424-7.
[Trends Genet. 2005]BMC Genomics. 2006 Jul 25; 7():187.
[BMC Genomics. 2006]Mol Cell. 2002 Jul; 10(1):105-15.
[Mol Cell. 2002]Cell. 2003 Jun 27; 113(7):919-33.
[Cell. 2003]Curr Opin Microbiol. 2004 Oct; 7(5):546-54.
[Curr Opin Microbiol. 2004]Yeast. 2006 Sep; 23(12):857-65.
[Yeast. 2006]Science. 2003 Jul 4; 301(5629):71-6.
[Science. 2003]Nature. 2003 May 15; 423(6937):241-54.
[Nature. 2003]Nucleic Acids Res. 2006 Jan 1; 34(Database issue):D247-51.
[Nucleic Acids Res. 2006]Science. 1996 Oct 25; 274(5287):546, 563-7.
[Science. 1996]Yeast. 2006 Sep; 23(12):857-65.
[Yeast. 2006]Nature. 2002 Jul 25; 418(6896):387-91.
[Nature. 2002]Nature. 2003 Oct 16; 425(6959):737-41.
[Nature. 2003]Nature. 2003 Oct 16; 425(6959):686-91.
[Nature. 2003]Curr Opin Microbiol. 2004 Oct; 7(5):546-54.
[Curr Opin Microbiol. 2004]Yeast. 2006 Sep; 23(12):857-65.
[Yeast. 2006]Nature. 2003 Oct 16; 425(6959):686-91.
[Nature. 2003]Nucleic Acids Res. 2006 Jan 1; 34(Database issue):D535-9.
[Nucleic Acids Res. 2006]J Bacteriol. 2001 May; 183(9):2881-7.
[J Bacteriol. 2001]Science. 2004 Feb 6; 303(5659):808-13.
[Science. 2004]Nucleic Acids Res. 2006 Jan 1; 34(Database issue):D535-9.
[Nucleic Acids Res. 2006]Biosci Biotechnol Biochem. 2006 Nov; 70(11):2646-52.
[Biosci Biotechnol Biochem. 2006]BMC Bioinformatics. 2002 Nov 13; 3():35.
[BMC Bioinformatics. 2002]Genetics. 2001 Dec; 159(4):1449-65.
[Genetics. 2001]J Biol Chem. 2006 Dec 8; 281(49):37270-4.
[J Biol Chem. 2006]Nature. 2007 Apr 12; 446(7137):806-10.
[Nature. 2007]Science. 2007 Feb 2; 315(5812):649-52.
[Science. 2007]Science. 2007 Feb 2; 315(5812):653-5.
[Science. 2007]Nature. 2005 Dec 1; 438(7068):679-84.
[Nature. 2005]PLoS Biol. 2004 Mar; 2(3):E79.
[PLoS Biol. 2004]Philos Trans R Soc Lond B Biol Sci. 2006 Jan 29; 361(1465):147-61.
[Philos Trans R Soc Lond B Biol Sci. 2006]Science. 1999 Nov 5; 286(5442):1153-5.
[Science. 1999]Science. 2001 Sep 14; 293(5537):2101-5.
[Science. 2001]RNA. 2007 Mar; 13(3):404-13.
[RNA. 2007]J Biol Chem. 2005 May 13; 280(19):19213-20.
[J Biol Chem. 2005]Res Microbiol. 1999 Apr; 150(3):199-204.
[Res Microbiol. 1999]Int J Food Microbiol. 2003 Sep 1; 86(1-2):11-22.
[Int J Food Microbiol. 2003]Microbiol Mol Biol Rev. 2006 Sep; 70(3):583-604.
[Microbiol Mol Biol Rev. 2006]J Biol Chem. 1999 Feb 19; 274(8):5032-7.
[J Biol Chem. 1999]Eukaryot Cell. 2004 Feb; 3(1):221-31.
[Eukaryot Cell. 2004]Proc Natl Acad Sci U S A. 1998 Jan 6; 95(1):253-7.
[Proc Natl Acad Sci U S A. 1998]Nat Genet. 2005 Oct; 37(10):1147-52.
[Nat Genet. 2005]Mol Cell. 2006 Jan 6; 21(1):87-96.
[Mol Cell. 2006]Science. 1996 Oct 25; 274(5287):546, 563-7.
[Science. 1996]Yeast. 2006 Sep; 23(12):857-65.
[Yeast. 2006]Cell. 1997 Jan 24; 88(2):243-51.
[Cell. 1997]FEBS Lett. 2000 Dec 22; 487(1):31-6.
[FEBS Lett. 2000]Nature. 2002 Jul 25; 418(6896):387-91.
[Nature. 2002]Nat Biotechnol. 2002 Jan; 20(1):58-63.
[Nat Biotechnol. 2002]Genome Res. 2002 Aug; 12(8):1210-20.
[Genome Res. 2002]Cell. 1997 Jan 24; 88(2):243-51.
[Cell. 1997]Nature. 2003 Oct 16; 425(6959):737-41.
[Nature. 2003]