![]() | ![]() |
Formats:
|
||||||||||||
Copyright ©2008 AMIA - All rights reserved. Annotating breast cancer microarray samples using ontologies 1Lombardi Comprehensive Cancer Center, Georgetown University Medical Center, Washington, DC 2Department of Information Systems, University of Maryland Baltimore County, Baltimore, MD This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose Abstract As the most common cancer among women, breast cancer results from the accumulation of mutations in essential genes. Recent advance in high-throughput gene expression microarray technology has inspired researchers to use the technology to assist breast cancer diagnosis, prognosis, and treatment prediction. However, the high dimensionality of microarray experiments and public access of data from many experiments have caused inconsistencies which initiated the development of controlled terminologies and ontologies for annotating microarray experiments, such as the standard microarray Gene Expression Data (MGED) ontology (MO). In this paper, we developed BCM-CO, an ontology tailored specifically for indexing clinical annotations of breast cancer microarray samples from the NCI Thesaurus. Our research showed that the coverage of NCI Thesaurus is very limited with respect to i) terms used by researchers to describe breast cancer histology (covering 22 out of 48 histology terms); ii) breast cancer cell lines (covering one out of 12 cell lines); and iii) classes corresponding to the breast cancer grading and staging. By incorporating a wider range of those terms into BCM-CO, we were able to indexed breast cancer microarray samples from GEO using BCM-CO and MGED ontology and developed a prototype system with web interface that allows the retrieval of microarray data based on the ontology annotations. Introduction Breast cancer is the most common cancer among women. Similar to other cancer types, breast cancer results from genetic and environmental factors leading to the accumulation of mutations in essential genes. Recent advance in high-throughput gene expression microarray technology has enabled breast cancer researchers to i) define expression patterns of breast cancer to identify specific phenotypes (diagnosis), ii) establish a patient’s expected clinical outcome independent of treatment (prognosis), and iii) predicate a potential outcome of a specific therapy (prediction)1–4. Many breast cancer studies that involved expression microarray have been published in the last ten years but differences between studies often result in inconsistent outcomes, causing considerable skepticism4–8. We believe that the inconsistency problem is attributable to the high dimensionality of microarray experiments which include a large number of sample clinical characteristics features2. In clinical practice, measurable characteristics (e.g., tumor size, spread to lymph nodes, distant metastases, and histological appearance), patient characteristics (e.g., age, smoking, and menopausal status), and immunohistochemstry characteristics (e.g., ER, PR, and ERBB2) were usually considered together as an approximate guide to tumor behavior and the basis for the prognostication and evaluation of the response to therapy. Thus it is important not to ignore those clinical characteristics since expression differences between two patient groups could be caused by other factors (e.g., age) rather than the target factor (e.g., different treatments). On the other hand, microarray studies are usually expensive and sometimes we do have insufficient samples to derive statistically sound conclusions. So it is desirable to be able to retrieve relevant microarray data from public repositories for integrative and exploratory analysis to generate appropriate hypotheses. The standard strategy towards appropriate analysis for comparative studies and integrative mining is to annotate microarray experiments using common standards such as controlled vocabularies or ontologies. Some resources have already been developed for capturing clinical characteristics of breast cancer and annotating microarray experiments. In this paper, we report our experience in exploring Gene Expression Data (MGED) ontology (MO)9 and NCI Thesaurus for annotating breast cancer microarray data available at Gene Expression Omnibus (GEO)10. Specifically, we tailored NCI Thesaurus to obtain breast cancer microarray clinical ontology (BCM-CO), an ontology to capture breast cancer microarray clinical information. The coverage of NCI Thesaurus was evaluated using samples available at GEO and the terms not covered were incorporated into BCM-CO. A prototype system was developed to allow retrieval of breast cancer microarray data using those ontology terms. In the following discussion, background information is provided for MO, NCI Thesaurus, and GEO. The study design is described next. We then present the results and the prototype system. Background Before introducing resources used in the study, we need a brief introduction of the Web Ontology Language (OWL), which is a common standard for semantic representation in defining and instantiating ontologies built on top of Resource Description Framework (RDF; http://www.w3.org/RDF). To formally specify meaning of annotations, an OWL provides a vocabulary of terms which describes classes, individuals, and properties: classes represent concepts from the knowledge domain (e.g., “cancer grade”); individuals are specific instances of classes (e.g., “high grade”); properties represent relationships between individuals. In this paper, classes and individuals were also referred as nodes. The MGED ontology was developed by the MGED society. It provides terms for annotating all aspects of a microarray experiment from the design and array layout to the preparation of the biological sample and the protocols used to hybridize the RNA and to analyze the data. Researchers could use this information to explore third-party data, validate comparisons between data, and assist interpretations of data. The current version (v.1.3.1.1) of the MGED ontology includes 233 classes, 681 individuals, and 143 properties. The MGED ontology intends to provide high-level concepts for describing microarray experiments, but refers to other resources for low-level concepts and individuals. Sample nodes were shown in Figure 1
The NCI Thesaurus is an ontology-like vocabulary to meet the needs of the cancer research community, which provides unambiguous codes and definitions for concepts used in cancer research, including cancer related diseases, findings and abnormalities, anatomy, agents, drugs and chemicals, genes and gene products and so on11. The current version (08_01d) is composed of about 61,000 classes structured into 20 taxonomic trees (i.e., kinds), and 123 properties. The NCI Metathesaurus is a broader system that integrates NCI Thesaurus with the Unified Medical Language Systems (UMLS) Metathesaurus. Within NCI, the Thesaurus and NCI Metathesaurus are used to provide terminology support to the public Web portal, http://cancer.gov, numerous portals supporting consortia, and other communities of researchers. The Gene Expression Omnibus (GEO) was initiated to serve as a public repository for a wide range of high-throughput experimental data, which includes data from single and dual channel microarray-based experiments measuring mRNA, miRNA, genomic DNA (arrayCGH, ChIP-chip, and SNP), and protein abundance, as well as data obtained using non-array techniques such as serial analysis of gene expression (SAGE), mass spectrometry peptide profiling, and various types of quantitative sequence data. Currently, GEO contains over 200,000 samples from over 8,000 experiments spanning over 4,000 different platforms. Study design and method There are several steps in the study design as shown in Figure 1 Obtaining the BCM-CO prototype As discussed above, the MGED ontology intends to provide high-level classes but refers to NCI Thesaurus and several other resources for lower-level classes and individuals. The table inside Figure 1
For each node, we recursively retrieved the fillers and their ancestors to form the BCM-CO prototype. Noticing NCI Metathesaurus has broader synonym coverage than NCI Thesaurus, we included all synonyms available in NCI Metathesaurus in the prototype by using the mapping information of NCI Thesaurus codes and the NCI Metathesaurus unique concept identifiers available at the NCI Metathesaurus release. Modifying BCM-CO We modified the prototype to develop BCM-CO by evaluating its prototype’s coverage of breast cancer microarray samples from GEO and incorporating missing terms and synonyms into the ontology. This practice has been well-known in the development of ontologies such as Gene Ontology12. The process of annotating data using the ontology leads to improvements in the quality and breadth of the ontology itself, which in turn lead to improvements in subsequent generations of annotations. We retrieved all single channel microarray samples from GEO indexed with term “Breast cancer” and Organism “Homo sapiens”. Figure 2
Four unstructured sections were identified containing clinical characteristics, i.e., Title, Source Name, Characteristics, and Description. We looked up all terms in the BCM-CO prototype in text using the following normalized string matching method based on the Specialist lexicon, one of the three components in the Unified Medical Language System (UMLS):
After mapping, the coverage was evaluated by inspecting experiment series with structured fields or those mapped only to top-level classes (e.g., Cell_Line) rather than individuals (e.g., MCF7). Results and discussions The BCM-CO prototype contained 1,267 classes and can be successfully loaded into ontology editors such as Protégé. We retrieved 5,407 single-channel breast cancer microarray samples from 72 GEO experiment series. There were 13,350 normalized synonyms (i.e., ignoring case and syntactic variance). The prototype coverage with respect to four categories is discussed below in detail. DiseaseState and Histology In the BCM-CO prototype, 82 histology nodes were mapped. We identified six series (GSE2109, GSE5949, GSE5720, GSE6595, GSE1477, and GSE7849) in which histology terms can be extracted using simple patterns. Overall, 52 terms were retrieved, among which 48 are histology terms, 22 (46%) out of 48 cannot be mapped to histology nodes in the prototype. Those un-mapped terms can be categorized into the following categories (shown in Table 1):
Compositional terms Compositional terms (e.g., “Invasive medullary carcinoma”) can be represented by combination of modifiers (e.g., “Invasive”) and histology nodes (e.g., “Medullary carcinoma”). In order to capture them, modifier terms needs to be included in BCM-CO. More modifier terms also needs to be added. For example, “Infiltrating” is not currently listed as a modifier term in NCI Thesaurus. Conjunction terms Some of those histology terms are conjunctions. We consider there is no need to capture those terms in the ontology because they can be annotated by the conjunctions of nodes. Abbreviations Expansions of those abbreviations (e.g., Invasive Duct Carcinoma for IDC) can be found in the prototype. Those abbreviations need to be added as synonyms of the corresponding nodes. Novel terms Some terms do not have the corresponding NCI nodes (e.g., “Cribiform Carcinoma”). New nodes need to be added into the prototype to index them accurately. StrainOrLine Overall, there are only three nodes being mapped: MCF7, Tumor_Cell_Line, and Cell_Line. We inspected all free text being mapped to “Cell_Line” and identified 11 breast cancer cell lines (e.g., “BT474”, “BT549”, “Hs578T”, “T47D”, “ZR751”) that are not included in NCI Thesaurus. According to the collection of breast cancer cell line available at Berkeley lab, there are 71 breast cancer cell lines which need to be added and into BCM-CO as individuals (http://icbp.lbl.gov/breastcancer/celllines.php). DiseaseStaging and TumorGrading Terms for disease staging and tumor grading are usually straight forward (e.g., “I”, “II”, “III”, “2A”, “2B”). But we found that there are different types of grading systems for breast cancer (e.g., “Elston (NGS) histology grade” or “B-R grade”) and NCI Thesaurus failed to distinguish them. As a result more low-level classes and individuals need to be added in BCM-CO. BCArray – An ontology-enabled breast cancer microarray database Our analysis showed that NCI Thesaurus (even including synonyms from NCI Metathesaurus) is far from comprehensive. Based on the coverage evaluation result, we incorporated missing terms discovered in the evaluation process into the prototype to obtain BCM-CO. We then used BCM-CO and MGED ontology to annotate breast cancer single channel microarray samples. Specially, we developed a simple parser to extract sample information under several MGED BioMaterialCharacteristics classes (e.g., “Age”, “Sex”, “BioMaterialPurity”, “Biometrics”, “ClinicalTreatment”, etc) and four classes evaluated here (“DiseaseState & Histology”, “StrainOrLine”, “TumorGrading”, and “DiseaseStaging”). We also extracted immunohistochemstry characteristics (e.g., ER, PR, and ERBB2) from the samples when possible. A web interface was also developed to allow researchers to retrieve data based on the ontology annotations (Figure 3
Conclusion We have demonstrated a way to tailor existing ontologies for specific applications. Consistent with several other studies aiming to use NCI Thesaurus for specific applications13,14, the study shows that NCI Thesaurus needs to be tailored and enriched and a variety of ontology engineering is needed for specific applications. Acknowledgments This project is partially supported by the following grants NSF/IIS-0639062, NIH/R01-CA096483 and DOD/BCRP BC030280. References 1. Sims AH, Ong KR, Clarke RB, Howell A. High-throughput genomic technology in research and clinical management of breast cancer. Exploiting the potential of gene expression profiling: is it ready for the clinic? Breast Cancer Res. 2006;8(5):214. [PubMed] 2. Clarke R, Ressom HW, Wang A, et al. The properties of high-dimensional data spaces: implications for exploring gene and protein expression data. Nat Rev Cancer. 2008 Jan;8(1):37–49. [PubMed] 3. Butte AJ, Kohane IS. Creation and implications of a phenome-genome network. Nat Biotechnol. 2006 Jan;24(1):55–62. [PubMed] 4. Perez-Iratxeta C, Bork P, Andrade MA. Options available—from start to finish—for obtaining data from DNA microarrays II. Nature Genetics. 2002;31:316–319. [PubMed] 5. Shi L, Tong W, Fang H, et al. Cross-platform comparability of microarray technology: intra-platform consistency and appropriate data analysis procedures are essential. BMC Bioinformatics. 2005 Jul 15;6(Suppl 2):S12. [PubMed] 6. Miller LD, Liu ET. Expression genomics in breast cancer research: microarrays at the crossroads of biology and medicine. Breast Cancer Research. 2007;9(2):206. [PubMed] 7. Draghici S, Khatri P, Eklund AC, Szallasi Z. Reliability and reproducibility issues in DNA microarray measurements. Trends Genet. 2005 8. Yauk CL, Berndt ML, Williams A, Douglas GR. Comprehensive comparison of six microarray technologies. Nucleic Acids Research. 2004;32(15):e124. [PubMed] 9. Whetzel PL, Parkinson H, Causton HC, et al. The MGED Ontology: a resource for semantics-based description of microarray experiments. Bioinformatics. 2006;22(7):866–873. [PubMed] 10. Barrett T, Suzek TO, Troup DB, et al. NCBI GEO: mining millions of expression profiles–database and tools. Nucleic Acids Res. 2005 Jan 1;33:D562–566. [PubMed] 11. Sioutos N, de Coronado S, Haber MW, Hartel FW, Shaiu WL, Wright LW. NCI Thesaurus: A semantic model integrating cancer-related clinical and molecular information. J Biomed Inform. 2007;40(1):30–43. [PubMed] 12. Blake JA, Bult CJ. Beyond the data deluge: data integration and bio-ontologies. J Biomed Inform. 2006 Jun;39(3):314–320. [PubMed] 13. Shah NH, Rubin DL, Supekar KS, Musen MA. Ontology-based Annotation and Query of Tissue Microarray Data. AMIA Annu Symp Proc. 2006;709:13. 14. Marquet G, Dameron O, Saikali S, Mosser J, Burgun A. Grading glioma tumors using OWL-DL and NCI Thesaurus. AMIA Annu Symp Proc. 2007;508:12. |
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||
Breast Cancer Res. 2006; 8(5):214.
[Breast Cancer Res. 2006]Nat Genet. 2002 Jul; 31(3):316-9.
[Nat Genet. 2002]Nat Genet. 2002 Jul; 31(3):316-9.
[Nat Genet. 2002]Nucleic Acids Res. 2004 Aug 27; 32(15):e124.
[Nucleic Acids Res. 2004]Nat Rev Cancer. 2008 Jan; 8(1):37-49.
[Nat Rev Cancer. 2008]Bioinformatics. 2006 Apr 1; 22(7):866-73.
[Bioinformatics. 2006]Nucleic Acids Res. 2005 Jan 1; 33(Database issue):D562-6.
[Nucleic Acids Res. 2005]J Biomed Inform. 2007 Feb; 40(1):30-43.
[J Biomed Inform. 2007]J Biomed Inform. 2006 Jun; 39(3):314-20.
[J Biomed Inform. 2006]