• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of procamiaLink to Publisher's site
AMIA Annu Symp Proc. 2010; 2010: 797–801.
Published online Nov 13, 2010.
PMCID: PMC3041299

An Ontology-Neutral Framework for Enrichment Analysis

Rob Tirrell, BS,1 Uday Evani, MS,2 Ari E. Berman, PhD,2 Sean D. Mooney, PhD,2 Mark A. Musen, MD, PhD,1 and Nigam H. Shah, MBBS, PhD1

Abstract

Advanced statistical methods used to analyze high-throughput data (e.g. gene-expression assays) result in long lists of “significant genes.” One way to gain insight into the significance of altered expression levels is to determine whether Gene Ontology (GO) terms associated with a particular biological process, molecular function, or cellular component are over- or under-represented in the set of genes deemed significant. This process, referred to as enrichment analysis, profiles a gene-set, and is relevant for and extensible to data analysis with other high-throughput measurement modalities such as proteomics, metabolomics, and tissue-microarray assays. With the availability of tools for automatic ontology-based annotation of datasets with terms from biomedical ontologies besides the GO, we need not restrict enrichment analysis to the GO. We describe, RANSUM – Rich Annotation Summarizer – which performs enrichment analysis using any ontology in the National Center for Biomedical Ontology’s (NCBO) BioPortal. We outline the methodology of enrichment analysis, the associated challenges, and discuss novel analyses enabled by RANSUM.

Introduction

In the past decade researchers have grappled with emerging high-throughput technologies and data analysis problems they present. Enrichment analysis provides a means of understanding the results of high-throughput datasets. [1, 2] Conceptually, enrichment analysis involves associating elements in the results of high-throughput data analysis to concepts in an ontology of interest, using the ontology hierarchy to create a summarization of the result, and computing statistical significance for any observed trend.

The canonical example of enrichment analysis is when the output dataset is a list of genes differentially expressed in some condition. Determining the biological relevance of a lengthy gene list is a difficult problem. In this case, the usual solution is to perform enrichment analysis with the GO [1]. We can aggregate the annotating GO concepts for each gene in this list, and arrive at a profile of the biological processes or mechanisms affected by the condition under study.

While GO has been the principal target for enrichment analysis, the methods of enrichment analysis are generalizable. We can conduct the same sort of profiling along other ontologies of interest. Just as scientists can ask Which biological process is over-represented in my set of interesting genes or proteins?; we can also ask Which disease (or class of diseases) is over-represented in my set of interesting genes or proteins? For example, by annotating known protein mutations with disease terms from the ontologies in BioPortal, Mort et al. recently identified a class of diseases—blood coagulation disorders—that were associated with a 14-fold depletion in substitutions at O-linked glycosylation sites. [3]

However, existing tools do not permit analysis along arbitrary ontology hierarchies. After more than 10 years of using enrichment analysis researchers still use GO exclusively for such analysis. There are currently over 400 publications on methods and tools for GO-based enrichment, but (to the best of our knowledge) only a single tool, Genes2Mesh, uses something other than the GO (i.e. the Medical Subject Headings or MeSH), to calculate enrichment [4].

Enrichment analysis provides an intuitive means of synthesizing the results of high-throughput experiments. It owes some degree of its popularity to the fact that the process is methodologically straightforward and yields these easily interpretable results. Apart from bioinformatics analyses, enrichment analysis can also be used as an exploratory tool to generate hypotheses for clinical research. Auto-generated annotations (from multiple ontologies) on patient cohorts can provide a foundation for enrichment analysis for risk-factor determination. For example, enrichment analysis can identify general classes of drugs, diseases, and test results that are commonly found in readmitted transplant patients but not in healthy recipients.

As noted, the GO has been the principal target for such analysis – the GO’s own website lists over 50 tools that can be used in this process. In 2005, Khatri et. al. [1] noted that, despite widespread adoption, GO-based enrichment analysis has intrinsic drawbacks—the primary one being incompleteness of and bias among available manually created annotations (even today, roughly 20% of genes lack any GO annotation).

Building upon existing tools and content in BioPortal, we create a system that addresses some of these drawbacks, and extends enrichment analysis to the numerous ontologies available in the NCBO‘s BioPortal. The NCBO’s mandate is to build a comprehensive library of biomedical ontologies and to create tools and methods allowing people to use these ontologies. In this context, we build upon our prior work of creating a comprehensive ontology library to create a tool for enrichment analysis using any biomedical ontology. [5]

RANSUM uses BioPortal as its repository of ontologies; BioPortal encapsulates disparate ontologies and related annotated data in one common interface available via REST Web service, and RANSUM builds on top of these services. Therefore, RANSUM is able to work with over any ontology in BioPortal – including GO and approximately 200 others. The ontology library is separately curated and updated by the administrators of BioPortal, decoupling us from the underlying representation and versioning of the ontologies.

Methods

We have addressed some of the principal shortcomings of present methods – especially by providing a mechanism to perform enrichment analysis when preexisting annotations do not exist. We have deployed our tool as a web service to enable the research community to utilize enrichment analysis in domains beyond just expression analysis. The workflow for enrichment analysis implemented by RANSUM is shown in Figure 1. There are two principal types of input scenarios for the service. In the first case, the user already has the elements of the dataset annotated with specific ontology terms; the user uploads a file associating element identifiers (gene names, patient ID numbers, etc.) with ontological term identifiers. This is the more traditional use-case.

Figure 1.
Workflow Schematic of Enrichment Analysis: If the input set has only textual annotations, we first run the Annotator service to create ontology-term annotations. The annotation counts in the input set are then aggregated along the ontology hierarchy and ...

Utilizing existing NCBO services, we also support a second novel use case. In this situation, the user uploads associations, of the same sorts of identifiers as described before, but to textual descriptions instead of ontology terms (Step 0). For example, a user can submit a file associating gene ids with their GeneRIF descriptions from NCBI. We invoke the NCBO Annotator service to process these textual descriptions and assign ontology terms to the element identifiers [6, 7] We access the Annotator via its programmatic interface. Given the user’s selection of an ontology and set of UMLS Semantic Types, the annotator processes the input text (say GeneRIFs) to identify concepts that match ontology terms (based on preferred names or synonyms). The implementation details and accuracy of this process are described in [8]. The result is a list of annotated element identifiers based on the input textual description, and this output is equivalent to the first input type. Using this step, we’re able to create ontology-based annotations for free-text descriptions. Thus, we are not reliant on the availability of exhaustive manually-curated annotations (such as those required with GO-based analyses).

Step 1 After this optional preprocessing step, for each ontology term in the input dataset we traverse the ontology structure and retrieve the complete listing of paths from the concept to the root(s) of the ontology. We walk through each of these paths, essentially recapitulating the ontology graph. Each term along the path is associated as an annotation to that element identifier in the input dataset to which the starting term was associated with. We refer to this procedure of tracing terms back to the graph’s root as performing the transitive closure over the ontology. In essence, for each child-parent (IS_A) relationship, we generate the complete set of implied (indirect) annotations based on child-ancestor relationships, by traversing and aggregating along the ontology hierarchy.

Step 2 Once we have a collection of ontology terms and their aggregate frequencies in the input dataset, we next approach the problem of determining the meaning or significance of the results. Enrichment analysis with GO has profited from the existence of a natural and easily defensible choice for a background set—all of the given organism’s genes, all genes measured on the platform, etc. For most of the ontologies we consider, no such distribution exists.

For calculating statistical enrichment, we need the background term frequency to determine if the aggregate annotation counts after step 1 are “surprising” given the background. Often, a clear background set doesn’t exist. This is perhaps the single greatest difficulty enrichment analysis faces. Leveraging existing projects and resources, there are several methods by which can begin to address this problem. Currently, we utilize a couple of heuristic approaches to address this problem.

First, we have access to a database of automatically created annotations over the entirety of MEDLINE abstract. This indexing was performed for nearly all of ontologies in BioPortal, and thus we will be able to use this data source as an approximate proxy for the true “background distribution” frequency of terms for most input datasets. To generate the background frequency, for a given ontological term X, we retrieve its preferred name and all of its synonyms, and then add up the MEDLINE occurrence counts for each of these strings. We return this number (n) as well as the total number of entries in the MEDLINE annotation database (N). The fraction n/N then represents the background frequency of the term X in the annotated corpus. Using this frequency we can compute significant comparative over- or under-representation in the input dataset. We also provide the option to compute this fraction n/N via a Google search made using the term X.

The second approach uses NCBO’s Resource Index, a large repository of automatically-created annotations of records in over 20 public biomedical databases. Access to the Resource Index enables us to make the same sort of calculations as with the MEDLINE term frequencies, but offers information above and beyond that. In particular, we can analyze co-occurrence of ontological terms in textual descriptions and annotations of datasets; enabling us to quantify the degree to which terms are independent or correlated in the annotation space.

Step 3 RANSUM provides two output mechanisms. The first is a tag cloud, which intuitively summarizes the results of the analysis. The sizes and colors of terms in the cloud indicate the relative frequency of the terms offering a high-level overview. However, a tag cloud’s representative ability is limited because there is no easy way to show significance relative to some expectation, or to show the elements in the input associated with some term (except via tooltips).

The second output format is in XML, which is amenable to postprocessing by the user, as needed. Each term node contains its respective frequency information in the input data along with the counts on which the frequency is based. The nodes additionally contain the list of identifiers that mapped to that term. Each node includes information on the level in the ontology at which the term is found. If the MEDLINE background frequency was requested, this data is also present for each node.

Results

A Prototype implementation

The prototype implementation demonstrates the technical feasibility of performing enrichment analysis over ontology hierarchies besides GO. The ability to do so builds upon the previous efforts of the NCBO[68]. Access to the service is via a standard web interface or via a Representational State Transfer (REST) endpoint available for programmatic access.

We anticipate that most interactions with the service will be programmatic. In order to facilitate this mode of access, we have created a set of command-line user-agents to handle the task of uploading datasets (once-off, or sequentially) and storing the analysis results. These command-line tools allow users to perform repetitive queries to the Web service, and can easily be integrated into existing workflows. These tools and documentation are available from: www.bioontology.org/wiki/index.php/Annotation_Summarizer

Novel use cases enabled

Analysis of protein annotations

Gene Ontology enrichment tools have traditionally been used to analyze list of genes obtained from high throughput experiments. [1] To demonstrate that RANSUM can perform enrichment analysis and recover known annotations as well as demonstrate enrichment analysis with multiple ontologies, we analyzed a list of 852 known aging related gene from the GenAge database. [9]

We started by collecting textual descriptions for UniProt protein entries corresponding to each human gene in the GenAge database. The textual descriptions included the protein name, gene name, general descriptions of the function and catalytic activity as well as keywords and GO terms. We processed this text as described in the RANSUM workflow and created summarized annotations from Medical Subject Heading (MSH), Online Mendelian Inheritance in Man (OMIM), UMLS Metathesaurus (MTH) and Gene Ontology (GO).

We created a background set of 19671 proteins by applying the same protocol to the entire human proteome—restricting to manually annotated and reviewed proteins from SwissProt (Jan 2010 version). We calculated enrichment and depletion of specific terms, corrected for multiple hypotheses outside of the RANSUM workflow and obtained a list of significant terms for all four ontologies.

Not surprisingly, ‘aging’ is an enriched term. There were several other terms enriched such as ‘electron transport’ (2.79e-10), ‘protein kinase activity’ (2.8e-10) and ‘nucleotide excision repair’ (8.78e-07) which appeared in MSH, MTH, and GO. The enriched terms also included aging related diseases such as ‘Alzheimer’s disease’ (0.01), ‘Werner syndrome’ (5.3e-05), ‘Diabetes Mellitus’ (1.5e-04) and ‘neurodegeneration’ (2.5e-03) from OMIM.

This case study demonstrate that enrichment analysis with multiple ontologies is feasible and it enables a comprehensive understanding of the biological “signal” present in gene/protein lists.

Analysis of funding trends

To demonstrate the utility of this tool in a novel domain, we processed the funding allocations of the NIH in fiscal years 1980–1989. We aimed to identify trends in institutional funding priorities over time, as represented by changes in the relative frequencies of ontology concepts from year-to-year. Using a database containing the complete set of grants in this interval—with their titles, amounts, recipient institutions, etc.—we selected grants worth over $1,250,000 (in constant 2008 dollars). We annotated the titles of these grants with SNOMEDCT terms and used the annotation sets to generate tag clouds for each year, such as the one shown in Figure 3 for year 1981, to create a visual summary of funding trends.

Figure 3:
Tag Cloud Output: An example for the annotations of grants from FY1981 using SNOMEDCT. Blue denotes low-frequency terms and red denotes highly frequent terms. Many concepts, such as “neoplasm of digestive tract”, occur at high frequencies ...

Hypothesis generation for Clinical Research

Enrichment analysis can be used as an exploratory tool to generate hypotheses for clinical research. For example, in the case of kidney transplants, extended-criteria donor (ECD) organs have a 40% rate of delayed graft function and a higher incidence of rejection compared to standard-criteria donor (SCD) kidneys. Identifying causes of this difference is crucial to identify patients in which an ECD transplant has a high chance of working.

The datasets collected to address this question comprise immunological metrics beyond the standard clinical risk factors, including multi-parameter flow-cytometric analysis of the peripheral immune-cell repertoire, genomic analysis, and donor-specific functional assessments. These patient data sets can be annotated using automated methods [7, 8] to enable enrichment analysis for risk-factor determination.

For example, simple enrichment analysis might allow identification of classes of drugs, diseases, and test results that are commonly found only in readmitted transplant patients. Enrichment analysis to identify common pairs of terms of different semantic types can identify combinations of drug classes and co-morbidities, or test risk-factors and co-morbidities that are common in this population.

Discussion and Future Work

To validate our methods, we can compare our results to those generated by other enrichment analysis tools run on synthetic datasets with known signal and enrichment patterns. This validation procedure, recently described by Toronen, et al. [10], provides positive proof of the soundness of the method.

The inconsistency of abstraction levels in ontologies is an often discussed stumbling block for enrichment analysis [1]. Two terms at equal depths may not represent concepts of similar granularity, creating a bias in the reported term enrichment. By analyzing the frequencies of terms in MEDLINE and the NCBO Resource Index, we can perform a thorough analysis of dependencies among ontology-term annotations to make existing biases explicit as well as to define custom abstraction levels using methods developed by Alterovitz et al. [11]

Probably the greatest challenge is the determination of an appropriate source of background term frequencies. We have access to several resources that can partially address this problem; we will compare the appropriateness of the different sources of background frequencies by using them with synthetic datasets with known signal levels.

Conclusions

Because enrichment analysis with GO is widely accepted and scientifically valuable, we believe that the logical next step is to extend this methodology to other ontologies. We’ve implemented this extension.

We’ve also addressed some of the limitations of existing analysis methods[1]. For example, the roughly 20% of genes that lack annotations can now be associated, via their GeneRIFs, with terms from multiple ontologies. We have clear courses of action for overcoming other limitations such as inconsistent abstraction levels.

We have enabled enrichment analysis using arbitrary textual descriptions, thus introducing enrichment analysis as a research methodology in new domains such as portfolio analysis and hypothesis generation for comparative effectiveness research; without requiring manually created annotations.

Figure 2:
XML Output: showing summarized annotations with terms in the Mammalian Phenotype ontology. The term-frequency in the input-dataset is included, as well as the background term frequency from the MEDLINE annotations and Google indices

Acknowledgments

This work is supported by NIH grant U54 HG004028 for the National Center for Biomedical Ontology. We thank Clement Jonquet and Adrien Coulet for helpful discussions and feedback.

References

1. Khatri P, Draghici S. Ontological analysis of gene expression data: current tools, limitations, and open problems. Bioinformatics. 2005;21(18):3587–95. [PMC free article] [PubMed]
2. Shah NH, Fedoroff NV. CLENCH: a program for calculating Cluster ENriCHment using the Gene Ontology. Bioinformatics. 2004;20(7):1196–7. [PubMed]
3. Mort M, et al. In silico functional profiling of human disease-associated and polymorphic amino acid substitutions. Hum Mutat. 2010;31(3):335–46. [PMC free article] [PubMed]
4. Ade AS, et al. Genes2Mesh. 2007. [cited March 2010]; Available from: http://gene2mesh.ncibi.org.
5. Noy NF, et al. BioPortal: ontologies and integrated data resources at the click of a mouse. Nucleic Acids Research. 2009;37:W170–W173. [PMC free article] [PubMed]
6. Shah NH, et al. Comparison of concept recognizers for building the Open Biomedical Annotator. BMC Bioinformatics. 2009;10(Suppl 9):S14. [PMC free article] [PubMed]
7. Jonquet C, et al. AMIA Summit on Translational Bioinformatics. San Francisco, CA: 2009. The Open Biomedical Annotator. [PMC free article] [PubMed]
8. Shah N, et al. Ontology-driven indexing of public datasets for translational bioinformatics. BMC Bioinformatics. 2009;10(Suppl 2):S1. [PMC free article] [PubMed]
9. de Magalhaes JP, et al. The Human Ageing Genomic Resources: online databases and tools for biogerontologists. Aging Cell. 2009;8(1):65–72. [PMC free article] [PubMed]
10. Toronen P, et al. Generation of Gene Ontology benchmark datasets with various types of positive signal. BMC Bioinformatics. 2009;10:319. [PMC free article] [PubMed]
11. Alterovitz G, et al. Ontology engineering. Nat Biotechnol. 2010;28(2):128–30. [PubMed]

Articles from AMIA Annual Symposium Proceedings are provided here courtesy of American Medical Informatics Association
PubReader format: click here to try

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

  • PubMed
    PubMed
    PubMed citations for these articles