![]() | ![]() |
Formats:
|
||||||||||||
Copyright © 2006, EMBO and Nature Publishing Group Global analysis of gene function in yeast by quantitative phenotypic profiling 1 Department of Radiation Oncology, Stanford University School of Medicine, Stanford, CA, USA 2 Department of Genetics, Stanford University School of Medicine, Stanford, CA, USA 3 Lewis-Sigler Institute for Integrative Genomics, Department of Computer Science, Princeton University, Princeton, NJ, USA a Department of Radiation Oncology, Stanford University School of Medicine, Stanford, 269 Campus Drive West, CCSR So. Room 1255, Stanford, CA 94305-5152, USA. Tel.: +1 650 723 5881; Fax: +1 650 723 7382; E-mail: mbrown/at/stanford.edu Received June 9, 2005; Accepted December 1, 2005. This article has been cited by other articles in PMC.Abstract We present a method for the global analysis of the function of genes in budding yeast based on hierarchical clustering of the quantitative sensitivity profiles of the 4756 strains with individual homozygous deletion of nonessential genes to a broad range of cytotoxic or cytostatic agents. This method is superior to other global methods of identifying the function of genes involved in the various DNA repair and damage checkpoint pathways as well as other interrogated functions. Analysis of the phenotypic profiles of the 51 diverse treatments places a total of 860 genes of unknown function in clusters with genes of known function. We demonstrate that this can not only identify the function of unknown genes but can also suggest the mechanism of action of the agents used. This method will be useful when used alone and in conjunction with other global approaches to identify gene function in yeast. Keywords: deletion pool, functional genomics, hierarchical clustering, phenotypic profiling, yeast Introduction A major challenge facing biologists today is the assignment of function to the novel genes identified during the sequencing phase of the human genome project. A useful resource for this task in the baker's yeast, Saccharomyces cerevisiae, has been the creation of a set of homozygous deletions of all nonessential genes, with each gene replaced by a cassette containing a 20-mer molecular ‘barcode' unique to each deletion mutant (Giaever et al, 2002). This set of deletion mutants has been used by a number of investigators to identify genes involved in response to DNA-damaging agents and in other processes (reviewed in Scherens and Goffeau, 2004). In most cases, investigators have tested the deletion strains individually rather than by hybridizing the amplified barcodes from a pool of all mutants to a high-density oligonucleotide array, which allows the relative abundance of all the strains in a pool of all deletion mutants to be determined. The hybridization method has the advantage that it allows each deletion strain to be rapidly ranked on a continuum for sensitivity or resistance to the environmental change rather than in discrete bins, such as sensitive, refractory or neutral, where the boundaries are subjective. We and others have shown that hybridization of the amplified DNA barcodes is a highly reproducible method of identifying genes responsible for resistance to DNA damage (Birrell et al, 2001; Wu et al, 2004; Lee et al, 2005). In the present study, we have explored further the use of quantitative phenotypic profiling of the 4756 viable yeast deletion mutants in response to a variety of agents, to identify gene function (or the ‘biological process' in gene ontology (GO) terms). We show that, at least for the processes interrogated, this is a powerful method and appears superior to other genome-wide methods of identifying gene function, including protein–protein interactions, gene expression profiling and synthetic lethality. Results and discussion Generation of phenotypic profiles for nonessential gene deletions We obtained phenotypic profiles of the pool of homozygous diploid deletions of all nonessential genes for a total of 51 diverse stresses, including some that we reanalyzed from publicly available databases (Table I). Each of these profiles provides a quantitative distribution of the sensitivity or resistance for an individual gene deletion. The final complete data set is provided in the Supplementary Information I, and all of the raw cel files generated by us are available on the supporting website (http://microarray-pubs.stanford.edu/phenotypic_profiling/).
Hierarchical clustering of the phenotypic profiles identifies treatments by mechanism of action To assess the degree of similarity of the phenotypic profiles of individual deletion strains, and therefore the likely gene products in various functional pathways, we clustered all the data without filtering out any of the phenotypically neutral deletions. Figure 1
Hierarchical clustering identifies gene function and compares favorably with other methods in identifying the genes in the DNA-damage response pathway The expanded portion of Figure 1 We analyzed the other three global methods for their ability to group the five genes involved in the DNA-damage checkpoint and NER (Table II). Protein–protein interactions identify only three members, Rad17p, Ddc1p and Mec3p, of the checkpoint group, and, in addition, identify 105 other proteins, most, if not all, of which are likely to be false positives. The data on synthetic lethality for these five genes also identify a number of genes that are likely not involved in the DNA-damage checkpoint, although hierarchical clustering of the profiles of synthetic lethality clusters four of the genes (all except MEC3) (Tong et al, 2004). Only four genes are synthetically lethal to all the DNA-damage checkpoint clusters of genes, of which the remaining unique lethalities argue incorrectly that they do not function in the same pathway. Table II also shows that expression profiling of yeast with highly similar sorts of treatments cannot cluster this group of genes, as none of the genes are coordinately regulated in response to DNA damage (Gasch et al, 2001) or to stress (Gasch et al, 2000). The nonessential genes involved in NER represent another well-studied pathway that is successfully clustered by phenotypic profiles as shown in Figure 1
Recently, efforts to combine the data sets and filter out the inherent noise have improved the ability to predict functional clusters of genes (Troyanskaya et al, 2003; Lee et al, 2004). As shown in Table II, combining the data sets filtered for quality is an improvement over any of the individual methods. The fact that phenotypic profiling is as good if not better than all of the other methods combined for at least these two functional groups demonstrates the interrogative power of the methodology. Analysis of gene clusters by GO A critical test of the value of phenotypic profiling is to identify the function of genes of previously unknown function. To determine the feasibility of this on a genome-wide scale, we first performed a rigorous statistical analysis (see Materials and methods) to divide the hierarchical cluster into subclusters of genes, such that the correlations by which the members of a subcluster were joined are significant. Using a false discovery rate (FDR) of 10%, we found 630 nonoverlapping subclusters, containing 3084 of the original 4281 genes. Some 860 and 1151 genes in these 630 subclusters are of unknown biological process or molecular function, respectively (not counting ‘dubious' ORFs). With the remaining 1197 unessential genes not currently assigned to a cluster at this cutoff, we have failed to elicit a significant shared phenotype for functionally related genes, suggesting that testing of more conditions designed to probe other cellular functions would cluster more functionally related genes. Next, we used GO (Ashburner et al, 2000), a set of three structured, controlled vocabularies that define the biological processes, molecular functions and cellular components of gene products, in conjunction with GO annotations for yeast gene products curated by the Saccharomyces Genome Database (http://www.yeastgenome.org/GOContents.shtml). Using these, we determined whether GO annotations were significantly enriched among the genes within each significant subcluster using GO::TermFinder (Boyle et al, 2004; http://search.cpan.org/dist/GO-TermFinder/), which when given a list of genes, determines whether any of the GO terms are significantly enriched as compared to the background of GO annotations in the population of all genes. A caveat to this is that as a large proportion of the GO terms are based on mutant phenotypes, the process of relating phenotypes to GO terms is somewhat circular. However, there is at present no alternative gold standard against which to test functional clusters. Of the 630 subclusters generated, 84 showed significant associations (with a Bonferroni corrected P-value to allow for multiple hypothesis testing of less than 0.01) with one or more biological processes, 51 with molecular functions and 61 with cellular components. The lack of significance of the majority of the subclusters with GO terms does not mean that the genes in these clusters are not functionally related, because (i) we have tested only 51 conditions, and therefore have not interrogated all the biological processes in the cell, and (ii) 176 of the subclusters had only two genes, thereby precluding statistical significance. We have developed a web-based tool for rapidly browsing the results of our analyses that displays the GO structure and the phenotypic profiling data for our significant subclusters (http://microarray-pubs.stanford.edu/phenotypic_profiling/index.shtml). This should prove useful for other researchers to find genes clustered with their genes, or process of interest and its utility will grow as new phenotypic profiles interrogating additional cellular processes are added. Comparison of phenotypic profiles as indicators of functional relationships with other genome-wide approaches In order to compare the present phenotypic profiling method with the other genome-wide data sets, we evaluated the enrichment of known functional relationships between pairs of genes with highly correlated phenotypic profiles using the biological process GO as a gold standard (Ashburner et al, 2000). To test the predictive power of our data for biological processes that were directly targeted with our selection of 51 conditions and agents, we limited the gold standard for comparison to the GO terms: DNA repair (GO:0006281), amino-acid biosynthesis (GO:0008652), cell-cycle checkpoint (GO:0000075), response to osmotic stress (GO:0006970), aerobic respiration (GO:0009060) and galactose metabolism (GO:0006012).
Figure 2A
Since the expression correlation used for comparison was computed over a set of 11 diverse data sets, one might expect that functional signal for the processes evaluated here might be obscured by the variety of other relationships present in the data. To test this hypothesis, we applied two different approaches to supervised feature selection on the 11 microarray data sets. First, we used a rank-sum statistic to test each microarray experiment (column) individually and selected the most functionally relevant set for the six GO terms of interest (Figure 2A We have also studied the enrichment of gene relationships in highly correlated phenotypic profiles across a broader range of biological processes. Figure 2B and C Identifying the function of unknown genes coclustered with known genes To test the hypothesis that an uncharacterized gene would function in the same pathway as the other genes in that cluster, we chose one of the subclusters identified by the GO analysis that included an ORF of unknown function. We chose the RIM subcluster (Supplementary Figure S1, Supplementary Information II), which contains many of the known proteins involved in the RIM101 pathway, so named for their ability to ‘regulate
IM
E2', a transcription factor involved in sporulation (von Mering et al, 2002). One gene, YGR122W, in the RIM cluster is currently not annotated for function, but would be predicted to be a member of the RIM101 pathway by our analysis. We performed a series of epistatic studies that not only showed that this ORF encoded a gene involved in sporulation, but also that it was in the same subpathway as the genes in the same phenotypic cluster (Figure S2, Supplementary Information II). This illustrates that clustering of phenotypic profiles can group genes in functional pathways not specifically interrogated by the agents or conditions used. Materials and methods Yeast strains Genotypes of the parental yeast strain BY4743, construction of the homozygous diploid deletion strains and construction of the homozygous diploid deletion pool have been described previously (Winzeler et al, 1999; Giaever et al, 2002). We use a mutant pool of 4756 strains containing nonessential homozygous diploid deletions in the parental diploid strain BY4743. Construction of double mutants and testing of sporulation efficiencies was performed according to standard procedures (Guthrie and Fink, 1991). Treatment assays, probe production and chip hybridization Deletion pool aliquots are resuspended in YPD media and treated as described earlier (Wu et al, 2004). Briefly, equivalent numbers of cells (6 × 106) are treated or mock treated for 1–16 h. For the acute treatments of less than 16 h, a fraction of the cell resuspension is washed and added to 60 ml YPD and grown for the indicated time at 30°C. After the regrowth period, the cells are harvested, the genomic DNA isolated, PCR amplification of the ‘molecular barcodes' performed, hybridized to custom-made Affymetrix oligonucleotide arrays and the chip scanned as described previously (Winzeler et al, 1999). Postscanning analysis to maximize data quality Each deletion strain is associated with four hybridization signals on the high-density oligonucleotide array generated in two separate PCR labeling reactions: UPTAG (sense and antisense) and DNTAG (sense and antisense). Equal numbers of cells are harvested in both the control and treated pools in order to produce equal pool label intensities. We normalize the data generated in the experimental array to that of the control array in order to eliminate any bias created during the PCR amplification reaction. In brief, we calculated separate UPTAG and DNTAG normalization factors, such that the total signal intensities of the UPTAGs and DNTAGs are equal in the control and experimental arrays. We calculate the background intensity of each array in order to identify those tags that fail to generate a signal sufficiently above the background level in the controls to produce a meaningful ratio as described earlier (Wu et al, 2004). For inclusion of strains in subsequent analyses, all three of the following criteria must hold:
Application of these quality control criteria eliminated 9% of all strains in the pool (i.e. we analyzed 4281 of the total 4756 deletion strains). Data taken from previously published works (Birrell et al, 2001; Giaever et al, 2002; Game et al, 2003; Haugen et al, 2004; Wu et al, 2004) have been reanalyzed with the current criteria using the original cel files to standardize the data for inclusion in the hierarcial clustering. An average control cel file was generated for the analysis of the published data, which utilized a common historic control for pools grown after five and 15 generations from the original cel files (Giaever et al, 2002). Hierarchical clustering For the unsupervised complete linkage analysis, we used the Pearson correlation coefficient method of similarity measurement resident in the HCE 2.0 software available from the University of Maryland Human–Computer Interaction Lab (http://www.cs.umd.edu/hcil/hce/), without normalization, uncentered, two-way complete linkage. Calculation of the number of significant subclusters As we are performing multiple tests (one per each joining correlation in the hierarchical cluster), it is necessary to perform a multiple hypothesis correction. To do this, we used the FDR, which is the expected proportion of true null hypotheses rejected out of the total number of rejections. In our case, the null hypothesis is that the joining correlation of a cluster of genes is not significant, and the FDR is the proportion of hypotheses that we think are significant, but are not, that is, we erroneously reject the null hypothesis. To calculate an FDR for the correlation of each of the nodes in our hierarchical cluster, we permuted the phenotypic profiling data matrix within both rows and columns, and then clustered the permuted data. We repeated this process 1000 times, and the joining correlations for every node in each clustering were saved. We then processed the correlations generated for our real data, comparing them to the correlations generated by the permuted data, as follows: Let S be the correlation for a node within a cluster, higher values of S being better. For a given correlation, S
0, we computed the following:
R=the number of nodes generated from the clustered real data with a correlation S
0.
V=the number of nodes, on average, generated from the permuted data, with a correlation S
0.The FDR for a given S
0, FDR(S
0), is thus given by V/R, and indicates the fraction of nodes with a correlation at or better than S
0, which would be expected to be false positives. Setting the FDR at 10%, we generated 630 nonoverlapping subclusters, containing 3084 of the original 4281 genes. GO functional relationship evaluation and comparison with other genome-wide data Each type of genomic data used for comparison is described in detail below. Affinity precipitation, synthetic lethality and yeast two hybrid Affinity precipitation, synthetic lethality and yeast two-hybrid data were obtained by merging data from the GRID and BIND databases (Breitkreutz et al, 2003; Alfarano et al, 2005). Transcription factor binding sites Transcription factor binding site data from SCPD (Zhu and Zhang, 1999) and TRANSFAC (Wingender et al, 2000) were downloaded from (http://seq.cbrc.jp/~wataru/PROSPECT/). Pairs of genes sharing a transcription factor were evaluated against the GO standard as described below. Microarray correlation We collected microarray data sets from the Stanford Microarray Database (Ball et al, 2005). Our collection includes 11 different studies, totaling 30 distinct biological conditions (see Supplementary Information II). Pearson's correlation coefficients for each gene pair were computed on each set of biological conditions separately and converted to standard normal z-scores. For example, for gene pair i–j in condition set k,
where ρ
ij
k
is the Pearson correlation coefficient computed over the kth set of conditions,
k
is the average correlation over all pairs for those conditions and σρ
k
is the standard deviation over all pairs for those conditions. The final combined score for each gene pair was then computed by summing the z-scores for all sets of conditions, that is,
Microarray feature selection Two supervised feature selection approaches were used on the diverse set of microarray data. For the individual experiment (column) feature selection approach, we used a rank-sum statistic to test the null hypothesis that the expression values of a group of coannotated genes (for any of the six GO terms of interest) was no different than a random set of genes of the same size drawn from the same array. Only experiments for which this hypothesis was rejected were used in computing the final correlation. For the data set feature selection approach (i.e. where sets of experiments from the same data set were either all included or excluded), we computed correlation coefficients between all pairs of related genes for the six GO terms on each data set separately. Then, we tested the null hypothesis that this set of functional correlations was no different from correlations between random pairs of genes in the same data set. For both approaches, we used ⅓ of the related genes from each of the six categories to select features, and evaluated the remaining ⅔. GO gold standard for evaluation To compare the power of phenotypic profiles for predicting functional relationships with other genome-wide approaches, we obtained sets of gold standard-positive and -negative gene pairs from the Saccharomyces Genome Database annotation of the biological process GO (Ashburner et al, 2000). To test the predictive power of our data for biological processes related to the biological functions that were directly targeted with our selection of 51 conditions and agents, we limited the gold standard to the GO terms DNA repair (GO:0006281), amino-acid biosynthesis (GO:0008652), cell cycle checkpoint (GO:0000075), response to osmotic stress (GO:0006970), aerobic respiration (GO:0009060), and galactose metabolism (GO:0006012). Propagating all annotations from children of these nodes upward to the nodes themselves yields a total of approximately 21 000 positive pairs. To obtain gold standard negatives, we first propagated each biological process annotation up to its parents and counted the number of total direct and indirect annotations per GO term. As the biological specificity of each term roughly corresponds to the number of total annotations, gene pairs whose most specific coannotation occurs in nodes with 1000 total annotations or more were considered negatives. Furthermore, we limited the set of negatives to pairs with at least one gene annotated to one of the six specific nodes above (or their children), resulting in a total of 1.3 million gold standard-negative pairs. For the comparison over a broader range of biological processes, we used the coannotation term size approach described above. Specifically, gold standard positives were all pairs of genes with coannotations in GO terms with less than 300 total direct and indirect annotations. Pairs of genes whose most specific coannotation occurs in nodes with 1000 total annotations or more were considered gold standard negatives, resulting in a total of approximately 500 000 positive pairs and 6 million negative pairs. Constructing the precision–recall curve Precision and recall for each genomic data type were calculated as described below.
True positives (TP): Gene pairs associated by data (high correlation) and annotated as positives in GO standard.
False positives (FP): Gene pairs associated by data (high correlation) and annotated as negatives in GO standard.
For continuous-valued data (e.g. correlation of phenotype profiles and microarray correlation), the precision and recall calculations above are computed for a range of thresholds. Each threshold yielded one point on the precision–recall curve by considering gene pairs whose correlation exceeds the threshold value as positive predictions and other pairs as negative. Supplementary Information I Click here to view.(1.4M, txt) Supplementary Information II Click here to view.(564K, doc) Acknowledgments This work was supported by a National Cancer Institute Grant P01 CA67166 (JMB), by a training Grant CA09302 (to JAB) from the National Cancer Institute, by Grant R01 HG003469 from the National Human Genome Research Institute (to GS), by a Quantitative and Computational Biology Program Grant T32 HG003284-01(CLM), and an NIH Grant 1R01GM071966-01A1 (to OGT). References
|
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||
Nature. 2002 Jul 25; 418(6896):387-91.
[Nature. 2002]Genome Biol. 2004; 5(7):229.
[Genome Biol. 2004]Proc Natl Acad Sci U S A. 2001 Oct 23; 98(22):12608-13.
[Proc Natl Acad Sci U S A. 2001]Cancer Res. 2004 Jun 1; 64(11):3940-8.
[Cancer Res. 2004]PLoS Genet. 2005 Aug; 1(2):e24.
[PLoS Genet. 2005]Mutat Res. 2005 Apr 1; 571(1-2):3-17.
[Mutat Res. 2005]Cancer Res. 2002 Sep 15; 62(18):5248-53.
[Cancer Res. 2002]Nature. 2000 Nov 23; 408(6811):433-9.
[Nature. 2000]Genome Biol. 2003; 4(7):R45.
[Genome Biol. 2003]Science. 2004 Feb 6; 303(5659):808-13.
[Science. 2004]Mol Biol Cell. 2001 Oct; 12(10):2987-3003.
[Mol Biol Cell. 2001]Mol Biol Cell. 2000 Dec; 11(12):4241-57.
[Mol Biol Cell. 2000]Proc Natl Acad Sci U S A. 2003 Jul 8; 100(14):8348-53.
[Proc Natl Acad Sci U S A. 2003]Science. 2004 Nov 26; 306(5701):1555-8.
[Science. 2004]Nat Genet. 2000 May; 25(1):25-9.
[Nat Genet. 2000]Bioinformatics. 2004 Dec 12; 20(18):3710-5.
[Bioinformatics. 2004]Nat Genet. 2000 May; 25(1):25-9.
[Nat Genet. 2000]Cell. 2000 Jul 7; 102(1):109-26.
[Cell. 2000]Nature. 2002 May 23; 417(6887):399-403.
[Nature. 2002]Science. 1999 Aug 6; 285(5429):901-6.
[Science. 1999]Nature. 2002 Jul 25; 418(6896):387-91.
[Nature. 2002]Cancer Res. 2004 Jun 1; 64(11):3940-8.
[Cancer Res. 2004]Science. 1999 Aug 6; 285(5429):901-6.
[Science. 1999]Cancer Res. 2004 Jun 1; 64(11):3940-8.
[Cancer Res. 2004]Proc Natl Acad Sci U S A. 2001 Oct 23; 98(22):12608-13.
[Proc Natl Acad Sci U S A. 2001]Nature. 2002 Jul 25; 418(6896):387-91.
[Nature. 2002]Radiat Res. 2003 Jul; 160(1):14-24.
[Radiat Res. 2003]Genome Biol. 2004; 5(12):R95.
[Genome Biol. 2004]Cancer Res. 2004 Jun 1; 64(11):3940-8.
[Cancer Res. 2004]Genome Biol. 2003; 4(3):R23.
[Genome Biol. 2003]Nucleic Acids Res. 2005 Jan 1; 33(Database issue):D418-24.
[Nucleic Acids Res. 2005]Bioinformatics. 1999 Jul-Aug; 15(7-8):607-11.
[Bioinformatics. 1999]Nucleic Acids Res. 2000 Jan 1; 28(1):316-9.
[Nucleic Acids Res. 2000]Nucleic Acids Res. 2005 Jan 1; 33(Database issue):D580-2.
[Nucleic Acids Res. 2005]Nat Genet. 2000 May; 25(1):25-9.
[Nat Genet. 2000]Cell. 2000 Jul 7; 102(1):109-26.
[Cell. 2000]Nature. 2002 Jul 25; 418(6896):387-91.
[Nature. 2002]Proc Natl Acad Sci U S A. 2001 Oct 23; 98(22):12608-13.
[Proc Natl Acad Sci U S A. 2001]Radiat Res. 2003 Jul; 160(1):14-24.
[Radiat Res. 2003]Cancer Res. 2004 Jun 1; 64(11):3940-8.
[Cancer Res. 2004]Genome Biol. 2004; 5(12):R95.
[Genome Biol. 2004]Mol Biol Cell. 2001 Oct; 12(10):2987-3003.
[Mol Biol Cell. 2001]Mol Biol Cell. 2000 Dec; 11(12):4241-57.
[Mol Biol Cell. 2000]Science. 2004 Nov 26; 306(5701):1555-8.
[Science. 2004]Science. 2004 Feb 6; 303(5659):808-13.
[Science. 2004]