![]() | ![]() |
Formats:
|
||||||||||||||||||||||||||||||||
Copyright © 2009 The Author(s) Text-based over-representation analysis of microarray gene lists with annotation bias Department of Pathology, School of Medicine, Cardiff University, Heath Park, Cardiff CF14 4XN, UK *To whom correspondence should be addressed. Tel: Phone: +44 (29) 206 87037; Email: kiplingd/at/cardiff.ac.uk Received December 1, 2008; Revised April 14, 2009; Accepted April 16, 2009. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. Abstract A major challenge in microarray data analysis is the functional interpretation of gene lists. A common approach to address this is over-representation analysis (ORA), which uses the hypergeometric test (or its variants) to evaluate whether a particular functionally defined group of genes is represented more than expected by chance within a gene list. Existing applications of ORA have been largely limited to pre-defined terminologies such as GO and KEGG. We report our explorations of whether ORA can be applied to a wider mining of free-text. We found that a hitherto underappreciated feature of experimentally derived gene lists is that the constituents have substantially more annotation associated with them, as they have been researched upon for a longer period of time. This bias, a result of patterns of research activity within the biomedical community, is a major problem for classical hypergeometric test-based ORA approaches, which cannot account for such bias. We have therefore developed three approaches to overcome this bias, and demonstrate their usability in a wide range of published datasets covering different species. A comparison with existing tools that use GO terms suggests that mining PubMed abstracts can reveal additional biological insight that may not be possible by mining pre-defined ontologies alone. INTRODUCTION The output of a microarray experiment is typically one or more lists of genes that show an ‘interesting’ change in expression in the context of that experiment. This is often not the end point of the analysis, but the starting point of a complex process of deriving biological interpretation. Many researchers interpret their results by manually reviewing the function of each gene based on literature or database searches, or by prior familiarity with the gene and a plausible link to the biology under study. This ad hoc annotation process is both time-consuming and prone to user bias. The need to formalise this interpretation process has led to the development of a range of tools, of which a family of statistical methods collectively known as over-representation analysis (ORA) is becoming increasingly popular among researchers undertaking microarray analysis. The fundamental question asked by ORA is: what biological terms or functional categories are represented in the gene list more often than expected by chance. The most common approach to test this statistically is by using the hypergeometric test (or its variants such as Fisher's exact test) to calculate the probability of seeing at least a particular number of genes containing the biological term of interest in the gene list. This mode of analysis has been implemented (with minor variations) in several publicly available software tools, including DAVID/EASEonline (1), FatiGO (2), GenMAPP (3), GoMiner (4) and OntoTools (5). Currently, the applications of ORA are largely limited to the mining of pre-defined ontologies (e.g. GO, MeSH) or pathway annotation (e.g. KEGG, BioCarta). These resources are, to a large extent, generated from manual literature reading by experts, with the aim of providing a structured, condensed and reduced description of the biological knowledge about genes in the scientific literature. However, due to its labour-intensive nature, such pre-defined functional annotations are inevitably limited in scope and flexibility, and cannot fully reflect the detail of all areas of biology that might be of interest. A much greater wealth of biological knowledge about genes is present only in the primary, text-based biomedical literature, which is readily accessed in the form of abstracts, and increasingly as full-text articles from selected biomedical journals. We were therefore interested to determine whether the successful applications of ORA can be extended beyond the mining of controlled vocabularies to a wider mining of free-text, initially in the form of PubMed abstracts. Our initial exploration into this approach was based on a simple tokenisation of PubMed abstracts, followed by the identification of over-represented tokens using the classical hypergeometric test. When this approach was tested on 52 literature-derived gene lists, we discovered a dramatic and hitherto underappreciated feature—gene lists derived from a typical microarray experiment tend to have more annotation (i.e. PubMed abstracts) associated with them than would be expected by chance. This bias can lead to a marked over-representation of many common (and likely uninformative) terms, interspersed with terms that appear to convey real biological insight. We have developed several solutions to this issue. The first is based on the use of a permutation test, but is hampered by being computationally intensive. Therefore two computationally tractable approaches for performing ORA mining on PubMed abstracts, based on the detection of outliers and the extended hypergeometric distribution, were developed. Here, we describe the unique features of these methods and illustrate their utility by applying them to several diverse biological datasets. MATERIALS AND METHODS Public datasets We used publicly available microarray datasets to evaluate the performance of the ORA methods described in this work. In total, 354 different gene lists were collected from 146 scientific papers, which cover experiments performed on 10 major Affymetrix platforms, including HG-U133A, HG-U133 Plus 2.0, Mouse 430 2.0, Rat 230 2.0, Arabidopsis ATH1, DrosGenome1, Drosophila 2.0, Xenopus laevis, C. elegans and Zebrafish. These gene lists are collectively referred to as the ‘literature gene lists’ and their details can be found in Supplementary Data 2. Two gene lists were selected to evaluate in more detail the performance of the ORA methods presented here:
Text corpus creation The methods described here require the initial creation of a text corpus that connects the textual information stored in PubMed abstracts with genes included in the microarray analysis. First, we mapped all the genes represented on an array to their corresponding EntrezGene identifiers (EGID) based on the mapping schemes provided by the appropriate Bioconductor metadata packages. Then, PubMed articles associated with these genes were obtained from the gene2pubmed file downloaded from NCBI (ftp://ftp.ncbi.nih.gov/gene/DATA; time stamp: 25 October 2007) in the form of EGID to PubMed identifier (PMID) mappings. PMIDs that are associated with more than one EGID were omitted because, based on manual inspection, these tend to be large-scale sequencing reports that contain information largely irrelevant to gene function. This lack of specificity can affect the performance of the text mining algorithms. PubMed articles passing this criterion were retrieved from the PubMed database using a customised Perl script implementing modules from the Entrez Programming Utilities (http://eutils.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html). Upon retrieval, the abstracts were tokenised on white space to produce single-word terms. Any redundancies were removed to produce a unique set of tokens for each gene. Terms composed exclusively of numbers were removed from the text corpus. Then, a simple stemming operation was applied to reduce plural to singular forms (e.g. ‘kinases’ becomes ‘kinase’). Verb tenses were stemmed to their root (e.g. ‘phosphorylates’ and ‘phosphorylated’ become ‘phosphorylate’). Other more elaborate analysis of spelling variants (e.g. catalyze, catalyse) and composite words (e.g. cell cycle, DNA polymerase) were not explored. Porter's stemming algorithm (http://tartarus.org/~martin/PorterStemmer/; Perl version, release 1) was adapted for this analysis. Definitions of Chip and List frequencies For each token associated with a given gene list, we calculated two values. The first one, called Chip frequency, is defined as the number of genes that contains the token of interest on the entire chip (i.e. background). The second value, called List frequency, represents the number of genes that are associated with the token of interest in the query gene list. Classical hypergeometric distribution-based ORA approach Suppose that the total number of genes in the background population is N, of which M are associated with a certain token of interest T. If we select K genes randomly from the entire microarray without replacement, the probability of seeing exactly x genes associated with T in K can be modelled by the hypergeometric distribution (9). Hence, the probability of seeing x or more genes containing token T in a random gene list of K genes can be calculated as the cumulative probability:
Permutation test The fundamental idea underlying this permutation approach is to create random gene list that matches the experimentally derived gene list not only in the number of genes but also the amount of associated PMIDs. This is achieved by replacing the abstracts for each gene in the experimentally derived gene list with other abstracts selected randomly (without replacement) from the text corpus. As such, the number of genes and abstracts (hence PMID) for each random gene list are kept the same as that of the experimentally derived gene list. This permutation procedure is repeated n = 100 000 times. An empirical P-value of over-representation can then be calculated for each token in the gene list as the fraction of times its frequency in the random gene list (r) is equal to or greater than that seen in the experimentally defined gene list (x):
Outlier: Outlier detection-based ORA This method is motivated by the observation that, on a scatter plot of Chip versus List frequencies, there are a set of biologically plausible terms that deviate substantially from the main data cluster and appear as outliers (see Figure 2
We developed an outlier detection procedure that determines within a group of tokens corresponding to the same List frequencies any tokens that have lower than expected Chip frequencies. Formally, we derive a Z-score for each token based on its Chip frequency and infer the statistical significance of this Z-score against the normal distribution as follows.
ExtendedHG: extended hypergeometric distribution-based ORA The extended hypergeometric distribution, also known as the Fisher non-central hypergeometric distribution, is a generalization of the classical hypergeometric distribution where the sampling procedure is biased (9–11). Assume that we draw out n balls without replacement from an urn containing N balls, of which m1 are red and m2 are white. The balls have different weights, where the weight for each red and white ball is w1 and w2, respectively. When sampling is unbiased (i.e. w1 = w2), the balls have equal probability of being taken (i.e. p1 = p2) and the results will follow the classical hypergeometric distribution. However, if sampling is biased such that the probability of taking ball of one colour is proportional to its weight but independent of the other balls, then the number of balls of a particular colour drawn will follow the binomial distribution:
On the condition that the sum of the independent binomial variables is fixed (i.e. ∑ xi = n), the number of red balls in our sample x will follow the extended hypergeometric distribution and the probability of seeing x red balls simply by chance is given by
Consensus gene age computation In order to gain an approximate measure of the ‘age’ (i.e. the length of time a gene has been known and researched upon) for genes on the HG-U133A array we used a combination of PubMed, OMIM (Online Mendelian Inheritance in Man) and HGNC (HUGO Gene Nomenclature Committee) to collate the following information for each gene: (1) the date of the earliest PubMed article that described the gene, (2) the date of the earliest article cited in an OMIM record for which the gene is described, (3) the date on which the gene first appeared in an OMIM record and (4) the date on which the gene first approved by HGNC. Based on these dates, we derived two ‘age estimates’ for each gene: literature- and database-based age. Literature-based age was calculated by averaging the values from (1) and (2), and it represents the date on which a gene was first referred to in the scientific literature. Database-based age was calculated as the mean of (3) and (4), and represents the date when a gene symbol was first approved or integrated into the public databases. There is a good correlation between the literature- and database-based ages (data not shown); therefore we calculated a final consensus age for each gene by taking average of the literature- and database-based ages. Platform and availability The algorithms for Outlier and ExtendedHG have been implemented in a public web server PAKORA (http://www.pakora.cf.ac.uk/pakora.php) for interactive use. The source codes (written in R and Perl) are available from the author upon request. A summary of the literature-derived gene lists that we have used in this project can be found in Supplementary Data 2. These gene lists are available for download from PAKORA. The R scripts used in this analysis were developed and tested under R-2.6 and BioConductor-2.1. The Perl scripts were developed and tested under Perl v5.8.7. All analysis was performed on a Windows PC with a 2.8 GHz processor and 2 GB of RAM. RESULTS We began by analysing a list of interferon-stimulated genes (ISG) derived from the biological data reported in Sanda et al. (6), using the classical hypergeometric distribution-based ORA approach. The ISG gene list was used as a testbed because it constitutes a relatively simple and well-studied example of transcriptional regulation. PubMed articles associated with genes represented on the Affymetrix HG-U133A array were collected and filtered to give a text corpus consisting of 107 517 abstracts (70% of the unfiltered collection). These abstracts were tokenised and stemmed to produce 220 290 unique single-word tokens for mining. Of these, 9486 tokens are associated with the ISG gene list. Classical hypergeometric test-based ORA is affected by annotation bias Initial use of the classical hypergeometric distribution-based method produced encouraging results when applied to the ISG gene list. In total, 94 tokens were called significantly enriched after Bonferroni correction (Table 1). Biologically plausible terms such as ‘interferon’, ‘IFN’, ‘antiviral’, ‘IFN-alpha’, ‘IFN-beta’, ‘inducible’ and ‘immune’, were amongst the most significant hits being called over-represented. These terms are related in principle to the role of interferon in modulating host immune responses against viruses and infection. However, these biologically relevant terms were interspersed with relatively uninformative terms for which it was less plausible that they were specifically associated with the biology of interferon-regulated gene expression. These include common words (such as ‘after’, ‘each’, ‘also’) and non-specific biological words (such as ‘synthesis’, ‘molecule’, ‘beta’). A similar mix of specific and non-specific tokens was generally seen for other gene lists that we have analysed (data not shown).
Some areas of biology have, historically, been subject to greater levels of research activity and this is reflected in the biological literature. We reasoned that if a particular experiment were focused on a particularly well-studied area of biology this might therefore lead to a greater number of PMIDs associated with the resultant gene list than might otherwise occur. This in turn would introduce a bias that would affect the application of the classical hypergeometric test. The interferon response is an example of a well-researched area and the 77 genes in the ISG gene list are annotated by 1514 PMIDs. However, if we were to create a 77-gene list by random sampling from the same set of background genes on the chip we would expect to see, on average, only 660 PMIDs associated with such a random gene list. Thus, in this example the ‘real world’ ISG gene list has 2.3 times more PMIDs associated with it than would be expected by chance. The consequence of this on the classical hypergeometric distribution-based ORA approach is that, for some tokens, there is a general shift towards appearing over-represented, simply because the background frequency is artefactually under-estimated. Therefore, even a relatively modest increase in token frequency of a common word would produce a significant P-value. To explore this further we collated 52 gene lists from the published literature that were based on use of the human HG-U133A array (see Supplementary Data 2 for details of these literature gene lists), and then compared the amount of PMIDs in them with that in an equivalent set of random gene lists. We found that gene lists derived by experimental means (i.e. the result of mining a real biological dataset) tend to have a greater number of associated PMIDs than equivalently sized random gene lists (Figure 1
Annotation bias and consensus gene age A ‘well-studied’ gene may in part reflect one that has been known for many years, thus allowing a substantial corpus of literature regarding it to be accumulated. To investigate the possible effect of this aspect of the history of recent scientific research we determined the ‘age’ of each gene represented on the HG-U133A chip. Here, gene age is defined as an approximate measure of how long a gene has been known and researched upon relative to other genes, and should not be confused with the molecular timescale of evolution in the genome. We derived a consensus age for each gene represented on the HG-U133A chip based on two criteria: (1) when the gene was first cited in the published literature and (2) when the gene was first integrated into the OMIM and HGNC public databases. The consensus gene age was computed as the average of these two measures (see ‘Materials and Methods’ section for more details), and ranges from year 1939 to 2007. In the context of this analysis, a gene with a consensus age of 1990 implies that it was discovered in approximately 1990, so it is considered older and has been studied for longer compared to a gene with a consensus age of 1998. We stratified all the genes by their consensus age and compared the amount of PMIDs associated with them. As expected, younger genes that have only recently been described have markedly fewer PMIDs associated with them; whereas older genes are generally better studied and cited by more PMIDs (Supplementary Data 1 Figure S1). This effect seen on individual genes is also translated into an effect on the mean age of genes in biologically derived gene lists. As can be seen in Figure 1 Overcoming annotation bias by permutation test Our initial attempt to address the effects of annotation bias used a permutation test that makes no assumptions about the underlying data distribution. The significance of a token was assessed by calculating an empirical P-value based on the creation of 100 000 random gene lists, each of which was matched for the number of genes and the amount of associated PMIDs. Tokens with List frequency equal to 1 were removed before they were subjected to permutation test because a token can only be useful in defining relationships among genes if it is shared by at least two of them. After this filtering, 4840 tokens remained for testing. This approach produced an improvement over the classical hypergeometric distribution-based approach when tested on the ISG gene list, insofar as it successfully retained those biologically plausible terms such as ‘interferon’, ‘IFN’, ‘antiviral’, whilst no longer called those less-specific terms as significant (Table 2). However, one limitation of this approach is the precision of the P-values, the smallest of which is determined by the number of permutations carried out. In this analysis, the best possible Bonferroni P-value attainable is 10–5 × 4840 = 0.0484. This could be improved by increasing the number of permutations, but as with many permutation-based methods, this approach is extremely computationally intensive, requiring six hours on a standard desktop computer to analyse the ISG gene list. To address this issue we developed two computationally efficient methods capable of handling the annotation bias problem.
Outlier: an empirical outlier detection method for finding over-represented terms in PubMed abstracts This method exploits observations made when plotting the number of genes that contains a token of interest on the entire chip (the Chip frequency) versus the number of genes that are associated with a token in the query gene list (the List frequency). On such plots (see Figure 2 To test this we developed an outlier detection method denoted here as Outlier. The detailed methodology is described in the Materials and Methods section but briefly, we use a plot such as Figure 2 ExtendedHG: identify over-represented terms in PubMed abstracts using the extended hypergeometric test The second method, called ExtendedHG, is a parametric approach based on the extended hypergeometric distribution. The key concept underlying this approach is that annotation bias will cause common, non-specific terms to have higher probabilities of being selected than expected by chance. Therefore the sampling procedure is biased, with the token frequency distribution following the extended hypergeometric distribution. The degree of bias is measured by the odds ratio, which is equivalent to the probability ratio of seeing a token of interest over other tokens simply by chance, and a P-value for each token can therefore be calculated. ExtendedHG produced results similar to the permutation test (Table 2). All 27 tokens identified as over-represented in the ISG gene list by ExtendedHG are biologically plausible, while those non-specific words that were called significant by the classical hypergeometric test-based approach such as ‘synthesis’, ‘molecule’, ‘after’ were not selected. As with Outlier, a typical runtime for ExtendedHG is ~20–30 s when applied to a 500-gene list. A comparison of the rankings between the top 100 most significant terms in ISG gene list shows that, despite minor differences in rank order, there is a good concordance between Outlier and ExtendedHG (Supplementary Data 1 Figure S2). False positive rates under the null hypothesis From the specificity perspective, an ideal ORA method should not find any significant terms in a random gene list. To estimate the false positive rates associated with the proposed methods, we created 1000 random gene lists by randomly sampling 50–2000 unique genes from the HG-U133A array and analysed them with Outlier and ExtendedHG. The false positive rate of Outlier ranges from 0.18 to 1.84, with shorter gene lists (<300 genes) being more susceptible to false positives. This is because the Z-scores distribution from the outlier detection procedure tends to show a slight negative skewness for short gene lists, but is closer to being normally distributed for longer gene lists (see Supplementary Data 4). ExtendedHG shows a low false positive rate (<0.01) even for short gene lists. Comparison with existing ORA tools Using the ISG gene list as the benchmarking dataset, we observed a good agreement between the biology associated with the enriched GO terms reported by the functional annotation tool DAVID 2.0 (http://david.abcc.ncifcrf.gov/home.jsp) and the PubMed abstract terms produced by our methods, with concepts related to immune response highly ranked by both approaches (Tables 2 and 3). As an illustration of the limitations of ontologies such as GO we noted that none of the significant GO terms gives an indication of the involvement of interferon, thus demonstrating how mining of PubMed abstracts can potentially reveal additional biological insight that is not possible by mining pre-defined ontologies alone.
Performance across different species Outlier and ExtendedHG can be readily extended to other species for which an associated corpus of PubMed abstracts is available, although their power will depend on the extent and quality of annotation. To test this, we analysed 354 gene lists collected from published literature spanning 10 major Affymetrix chip types and eight model organisms including human, mouse, rat, Arabidopsis, Drosophila, C. elegans, Xenopus and Zebrafish (see Supplementary Data 2 for details of these gene lists). We found that the number of tokens identified as over-represented by the two methods varies substantially between species (Supplementary Data Figures S3 and S4). This appears to be related to the amount of annotation available to each species in the text corpus used. Those species with a higher amount of overall annotation per gene tend to produce, on average, more significant tokens per gene list tested (Figure 3
DISCUSSION We have presented several approaches for mining literature-based information associated with a list of differentially expressed genes (DEG) and to search within them for terms or biological concepts that are significantly over-represented. Our initial explorations using the classical hypergeometric distribution revealed a hitherto unexpected bias in the degree of PubMed annotation associated with gene lists derived from ‘real’ biological experiments. We hypothesised that gene lists generated from real-life biological experiments are likely to be biased towards older genes (i.e. known for a longer period of time) from more established areas of biology. Indeed, as shown in Figure 1 To address this annotation bias we have implemented three different approaches to ORA using PubMed tokens, based on a permutation test, an outlier detection method, and the use of the extended hypergeometric distribution. The latter two are computational tractable and this enabled us to benchmark them against 354 literature-derived gene lists. We find that tokens plausibly relevant to each study are often called significantly enriched, whilst the apparent over-representation of common terms due to annotation bias are successfully avoided. The results produced by the proposed methods generally show a good concordance in most analyses that we have performed. These tools provided a similar but distinct insight into the themes over-represented in a gene list compared to the results from undertaking ORA using GO terms, and can be successfully applied not only to well-annotated species but also to model species such as Arabidopsis. Evaluating the performance of any exploratory approach such as those proposed herein is a challenging task because it is difficult to find datasets for which the ground truth is known. We have therefore undertaken a more focused approach to assess the performance of the proposed methods. Specifically, we focused on gene lists based on the HG-U133A array, and compared the outcome from Outlier and ExtendedHG with those obtained from a standard ORA approach that mines GO terms. The biological relevance and plausibility of the over-represented tokens and GO terms were then assessed against the perceived biology of the original publication. These set of data are presented as Supplementary Data 3. For literature gene lists derived from other arrays, their token- and GO-based ORA results are readily accessible via our website for review by researchers with the relevant biological background. Several groups have undertaken the challenge of incorporating literature-based information into data mining algorithms to interpret the underlying biological significance of a list of DEGs (12,14,15); their approaches differ fundamentally from our methods. The closest in spirit to ours is the GEISHA (Gene Expression Information System for Human Analysis) system developed by Blaschke et al., which evaluates the significance of terms associated with a gene cluster by comparing their frequency of abstracts with the frequency of abstracts containing these terms in different gene clusters. However, the online version of this system was only implemented for E. coli and yeast. Therefore, it has not been possible to perform a direct comparison between this tool and our methods. Like other ORA approaches, our methods require an initial selection of DEGs by an arbitrarily chosen cut-off threshold. A major criticism to such ‘threshold-based’ approach is that different choices of the cut-off value will produce different lists of DEGs and alter the result of the enrichment analysis. Moreover, many genes with moderate but meaningful expression changes may be discarded by the selected threshold regardless of their relative position in the ranked list, leading to a loss in statistical power. In recent years, an alternative mode of analysis that does not involve an initial gene selection step has been proposed. Examples of these include Gene Set Enrichment Analysis (GSEA) (16–18) and Functional Class Scoring (FCS) (19). These methods consider the distribution of a functionally defined group of genes in the ranked list of genes and allow adjustments for their correlation structure. While a few studies have shown that such threshold-free approach enables the detection of more subtle functional categories that were overlooked by ORA (19,20), Manoli and coworkers (21) found that ORA produced more consistent results than GSEA with respect to the concordance between analyses on DEG obtained by different statistical methods from three prostate cancer data sets. Although it would be computationally challenging in scale, it may be possible to develop threshold-free methods that can accommodate annotation bias and thus be applied to the mining of PubMed tokens and we are currently exploring this question. The methods described here depend on a corpus of articles relevant to the genes being studied (e.g. all genes appearing on an array), and an index that links the articles to the appropriate genes. We used the manually curated citations provided by NCBI to retrieve the relevant gene-related PubMed abstracts. Although such curation provides for high quality, this process together with the volume of research activity in different areas means that the coverage of less heavily studied species is still limited and this has a direct effect on the power of our method. Incorporation of additional gene-citation links, perhaps from species-specific databases, would increase the amount of textual information in the corpus and improve the power of the proposed methods. Our methods are currently based on a simple processing and analysis of the text corpus. There are several areas where this could be made more sophisticated and complex in the future, such as the removal of stopwords, the use of thesauri to allow for the identification of multi-word biological concepts and synonyms mapped to the same gene. These steps should reduce the noise caused by natural language variation and improve the information content of the over-represented tokens. To conclude, we have described the problems and challenges associated with existing ORA methods when adapting them for mining text-based information, and three novel approaches have been proposed to address some of these problems. Analysis performed on several independent datasets show that the proposed methods produce biologically meaningful results that are in good agreement with the manually determined annotations (Supplementary Data 3). These examples also demonstrate that a coherent picture that exists within complex group of genes can be discerned by incorporating textual information embedded in literature as a knowledge source into the analysis of gene expression data. We believe that the proposed text-based ORA approaches can be used to complement and extend existing ontology-based functional analysis tools for guiding the biological interpretation of complex microarray data. FUNDING Cancer Research UK (grant number C8731/A5579). Funding for open access charge: Cancer Research UK (grant number C8731/A5579). Conflict of interest statement. None declared. Supplementary Data are available at NAR Online. [Supplementary Data]
ACKNOWLEDGEMENTS We thank Peter Giles for help with establishment of the web server for these methods. We are also grateful to Peter Holmans, Alex Richards, Peter Giles and Suraj Menon for reading this manuscript and making constructive suggestions. REFERENCES 1. Hosack DA, Dennis G, Jr, Sherman BT, Lane HC, Lempicki RA. Identifying biological themes within lists of genes with EASE. Genome Biol. 2003;4:R70. [PubMed] 2. Al Shahrour F, Diaz-Uriarte R, Dopazo J. FatiGO: a web tool for finding significant associations of Gene Ontology terms with groups of genes. Bioinformatics. 2004;20:578–580. [PubMed] 3. Dahlquist KD, Salomonis N, Vranizan K, Lawlor SC, Conklin BR. GenMAPP, a new tool for viewing and analyzing microarray data on biological pathways. Nat Genet. 2002;31:19–20. [PubMed] 4. Zeeberg BR, Feng W, Wang G, Wang MD, Fojo AT, Sunshine M, Narasimhan S, Kane DW, Reinhold WC, Lababidi S, et al. GoMiner: a resource for biological interpretation of genomic and proteomic data. Genome Biol. 2003;4:R28. [PubMed] 5. Draghici S, Khatri P, Bhavsar P, Shah A, Krawetz SA, Tainsky MA. Onto-Tools, the toolkit of the modern biologist: Onto-Express, Onto-Compare, Onto-Design and Onto-Translate. Nucleic Acids Res. 2003;31:3775–3781. [PubMed] 6. Sanda C, Weitzel P, Tsukahara T, Schaley J, Edenberg HJ, Stephens MA, McClintick JN, Blatt LM, Li L, Brodsky L, et al. Differential gene induction by type I and type II interferons and their combination. J. Interferon Cytokine Res. 2006;26:462–472. [PubMed] 7. Baldi P, Long AD. A Bayesian framework for the analysis of microarray expression data: regularized t-test and statistical inferences of gene changes. Bioinformatics. 2001;17:509–519. [PubMed] 8. Nishimura MT, Stein M, Hou BH, Vogel JP, Edwards H, Somerville SC. Loss of a callose synthase results in salicylic acid-dependent disease resistance. Science. 2003;301:969–972. [PubMed] 9. Johnson NL, Kemp AW, Kotz S. Univariate Discrete Distributions. 3rd. New York: Wiley; 2005. 10. Harkness WL. Properties of the extended hypergeometric distribution. Ann. Math. Stat. 1965;36:938–945. 11. Fog A. Sampling methods for Wallenius' and Fisher's noncentral hypergeometric distributions. Commun. Stat.: Simulat. Comput. 2008;37:241–257. 12. Blaschke C, Oliveros JC, Valencia A. Mining functional information associated with expression arrays. Funct. Integr. Genomics. 2001;1:256–268. [PubMed] 13. Khatri P, Draghici S. Ontological analysis of gene expression data: current tools, limitations, and open problems. Bioinformatics. 2005;21:3587–3595. [PubMed] 14. Chaussabel D, Sher A. Mining microarray expression data by literature profiling. Genome Biol. 2002;3 RESEARCH0055. 15. Glenisson P, Coessens B, Van Vooren S, Mathys J, Moreau Y, De Moor B. TXTGate: profiling gene groups with text-based information. Genome Biol. 2004;5:R43. [PubMed] 16. Mootha VK, Lindgren CM, Eriksson KF, Subramanian A, Sihag S, Lehar J, Puigserver P, Carlsson E, Ridderstrale M, Laurila E, et al. PGC-1alpha-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nat. Genet. 2003;34:267–273. [PubMed] 17. Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl Acad. Sci. USA. 2005;102:15545–15550. [PubMed] 18. Tian L, Greenberg SA, Kong SW, Altschuler J, Kohane IS, Park PJ. Discovering statistically significant pathways in expression profiling studies. Proc. Natl Acad. Sci. USA. 2005;102:13544–13549. [PubMed] 19. Pavlidis P, Qin J, Arango V, Mann JJ, Sibille E. Using the gene ontology for microarray data mining: a comparison of methods and application to age effects in human prefrontal cortex. Neurochem. Res. 2004;29:1213–1222. [PubMed] 20. Ben-Shaul Y, Bergman H, Soreq H. Identifying subtle interrelated changes in functional gene categories using continuous measures of gene expression. Bioinformatics. 2005;21:1129–1137. [PubMed] 21. Manoli T, Gretz N, Grone HJ, Kenzelmann M, Eils R, Brors B. Group testing for pathway analysis improves comparability of different microarray datasets. Bioinformatics. 2006;22:2500–2506. [PubMed] |
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||||||||||||||||||
Genome Biol. 2003; 4(10):R70.
[Genome Biol. 2003]Bioinformatics. 2004 Mar 1; 20(4):578-80.
[Bioinformatics. 2004]Nat Genet. 2002 May; 31(1):19-20.
[Nat Genet. 2002]Genome Biol. 2003; 4(4):R28.
[Genome Biol. 2003]Nucleic Acids Res. 2003 Jul 1; 31(13):3775-81.
[Nucleic Acids Res. 2003]J Interferon Cytokine Res. 2006 Jul; 26(7):462-72.
[J Interferon Cytokine Res. 2006]Bioinformatics. 2001 Jun; 17(6):509-19.
[Bioinformatics. 2001]Science. 2003 Aug 15; 301(5635):969-72.
[Science. 2003]J Interferon Cytokine Res. 2006 Jul; 26(7):462-72.
[J Interferon Cytokine Res. 2006]Bioinformatics. 2001 Jun; 17(6):509-19.
[Bioinformatics. 2001]Science. 2003 Aug 15; 301(5635):969-72.
[Science. 2003]Genome Biol. 2003; 4(10):R70.
[Genome Biol. 2003]Genome Biol. 2003; 4(10):R70.
[Genome Biol. 2003]Genome Biol. 2003; 4(10):R70.
[Genome Biol. 2003]Bioinformatics. 2004 Mar 1; 20(4):578-80.
[Bioinformatics. 2004]Nat Genet. 2002 May; 31(1):19-20.
[Nat Genet. 2002]Genome Biol. 2003; 4(4):R28.
[Genome Biol. 2003]Genome Biol. 2003; 4(10):R70.
[Genome Biol. 2003]Bioinformatics. 2004 Mar 1; 20(4):578-80.
[Bioinformatics. 2004]Nat Genet. 2002 May; 31(1):19-20.
[Nat Genet. 2002]Genome Biol. 2003; 4(4):R28.
[Genome Biol. 2003]J Interferon Cytokine Res. 2006 Jul; 26(7):462-72.
[J Interferon Cytokine Res. 2006]Genome Biol. 2003; 4(10):R70.
[Genome Biol. 2003]Bioinformatics. 2004 Mar 1; 20(4):578-80.
[Bioinformatics. 2004]Science. 2003 Aug 15; 301(5635):969-72.
[Science. 2003]Funct Integr Genomics. 2001 Mar; 1(4):256-68.
[Funct Integr Genomics. 2001]Bioinformatics. 2005 Sep 15; 21(18):3587-95.
[Bioinformatics. 2005]Funct Integr Genomics. 2001 Mar; 1(4):256-68.
[Funct Integr Genomics. 2001]Genome Biol. 2004; 5(6):R43.
[Genome Biol. 2004]Nat Genet. 2003 Jul; 34(3):267-73.
[Nat Genet. 2003]Proc Natl Acad Sci U S A. 2005 Oct 25; 102(43):15545-50.
[Proc Natl Acad Sci U S A. 2005]Proc Natl Acad Sci U S A. 2005 Sep 20; 102(38):13544-9.
[Proc Natl Acad Sci U S A. 2005]Neurochem Res. 2004 Jun; 29(6):1213-22.
[Neurochem Res. 2004]Bioinformatics. 2005 Apr 1; 21(7):1129-37.
[Bioinformatics. 2005]