![]() | ![]() |
Formats:
|
||||||||
Copyright © 2009 The Author(s) PLAN2L: a web tool for integrated text mining and literature-based bioentity relation extraction 1Structural Biology and Biocomputing programme, Spanish National Cancer Center (CNIO), Melchor Fernandez Almagro 3, Madrid, 28029, Spain and 2Barcelona Media - Centre d'Innovacio, Av. Diagonal 177, 08018 Barcelona, Spain *To whom correspondence should be addressed. Tel: Phone: +34 91 224 6900; Fax: +34 91 224 6980; Email: mkrallinger/at/cnio.es Received March 16, 2009; Revised May 13, 2009; Accepted May 19, 2009. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. This article has been cited by other articles in PMC.Abstract There is an increasing interest in using literature mining techniques to complement information extracted from annotation databases or generated by bioinformatics applications. Here we present PLAN2L, a web-based online search system that integrates text mining and information extraction techniques to access systematically information useful for analyzing genetic, cellular and molecular aspects of the plant model organism Arabidopsis thaliana. Our system facilitates a more efficient retrieval of information relevant to heterogeneous biological topics, from implications in biological relationships at the level of protein interactions and gene regulation, to sub-cellular locations of gene products and associations to cellular and developmental processes, i.e. cell cycle, flowering, root, leaf and seed development. Beyond single entities, also predefined pairs of entities can be provided as queries for which literature-derived relations together with textual evidences are returned. PLAN2L does not require registration and is freely accessible at http://zope.bioinfo.cnio.es/plan2l. INTRODUCTION Gene regulatory mechanisms and protein interactions are studied in detail to understand how complex developmental processes are controlled. Biological annotation databases provide functional descriptions of gene products, the basic components of such biological processes, through manual literature inspection, resulting generally in associations of biological entities to a set of controlled vocabulary terms contained in structured database records (1). Despite the obvious strength of controlled vocabularies for annotation consistency, information exchange and data analysis, functional annotations of proteins do not provide a straightforward way to trace back the biological evidence supporting each annotation, making it sometimes cumbersome for human domain experts to directly interpret under which biological context and experimental conditions a given annotation applies. Considering the growing amount of published articles, to manually annotate newly described functional gene product characterizations as well as maintenance and update of already annotated entities is a challenging task. This motivated recent attempts to enable a more systematic access to relevant information hidden in large literature repositories using text mining and information extraction (IE) technologies, with the aim of not only supporting the literature curation process, but especially for providing suitable information retrieval systems useful for life sciences (2). Results generated by text mining systems have the general advantage to be directly interpretable by the end users, the human domain experts, in case direct links to the textual evidences are supported. Currently existing online literature mining systems mostly focus on very particular biological aspects, such as the extraction of protein–protein interactions (3), protein–keyword co-mentions or gene regulation events without directly integrating all these heterogeneous relation types into a single application. Relevance, not only of individual bio-entities but also of their interactions and regulation events for developmental processes studied in model organisms such as the plant Arabidopsis thaliana have not been addressed previously using text mining approaches. Arabidopsis thaliana, the first plant to be completely sequenced, is being used not only for experimental research, but also increasingly by systems biology and bioinformatics approaches to understand and model central biological processes (4). Databases like TAIR [The Arabidopsis Information Resource, (5)] or UniProt (6) are providing plant biologists with valuable infrastructures of manually curated information, but only few attempts have been made to implement literature mining applications for this model organism. The Dragon Plant Biology Explorer (DPBE) was an online text mining application for plant biology based on integration and combination of collections of manually curated vocabularies compiled for several topics to facilitate a more targeted literature search (7). PubSearch constitutes another system for semi-automated retrieval of literature that can be used to curate articles to extract manually annotations for gene products with Gene Ontology terms (8). It is primarily based on simple term matching and does not explore machine learning techniques to provide more sophisticated retrieval capabilities. Another system primarily used for literature curation by model organism databases is Textpresso (9), now integrated at TAIR to facilitate a better access to information relevant for Arabidopsis. SYSTEM DESCRIPTION PLAN2L is an online text mining application dedicated to improve retrieval of knowledge by integrating and scoring information extracted from textual sources for various biological topics related with the description interactome associations in Arabidopsis. Figure 1
TECHNICAL DESCRIPTION OF THE TEXT MINING PIPELINE A document retrieval pipeline that takes into account several sources of evidence for the determining whether a given article is associated to A. thaliana was implemented exploiting: (i) external references derived from multiple databases providing annotations for Arabidopsis proteins. (ii) Organism and taxonomic name tagging using dictionary lookup based on a species lexicon derived from the NCBI Taxonomy that was automatically extended using a rule-based approach to account for typographical variants and abbreviations of species names. (iii) Keyword-based retrieval from PubMed and PubMed Central. Additionally, a full text collection of Arabidopsis-related articles was constructed from a local repository of open access full-text articles as well as using customized article collection tools. Plain text conversion was carried out through a combination of systems including pdftotext and PDFlib. The detection of links between the literature and protein or genes of PLAN2L is based on the construction and lookup of a gene lexicon. This gene dictionary integrated A. thaliana gene names and symbols derived from multiple databases, including TAIR, SwissProt and from a collection of gene and protein names identified by a machine learning-named entity recognition program (ABNER) as well as based a rule-based approach considering morphological cues and name length to identify potential Arabidopsis gene symbols. Lexicon expansion using manually crafted rules was carried out. To detect gene regulatory relations, we adapted an IE architecture relying on a pipeline of semantic/syntactic rules. We applied part-of-speech tagging of each word using a GENIA-trained version of Treetager (10). Some of the POS tags were automatically substituted with more semantically oriented labels (e.g. organisms, protein/gene names and activation verbs). The text with mixed syntactic and semantic tags was fed into a SCOL parser (11), which generated a tree-like structure by applying a modified CASS grammar. These rules constitute cascades of finite-state automata, and use patterns that combine both grammatical- and biological-meaning features in the linguistic structure. We implemented extensions of the rules to handle frequent phrase coordination and prepositional anaphora. The extraction of protein interaction evidence associations was addressed using a machine learning sentence classifier approach relying on manually selected interaction evidence sentences (12). The used sentence classifier relies on a Support Vector Machines algorithm trained on set of manually classified interaction evidence passages derived from a collection used at the second BioCreative challenge (12), and obtained a performance of 89.75 for precision and 92.62 for recall using a radial basis kernel function on a balanced test set. For retrieving protein localization descriptions, we explored both the use of semantic–syntactic frames for extracting a fine-grained association between proteins and subcellular location mentions together with a machine learning sentence classifier for retrieving protein localization description sentences in general. The initial step followed, consisted in the construction of a sub-cellular location dictionary that integrates location keywords and synonyms derived from SwissProt together with cellular component terms from Gene Ontology. Location term mentioning sentences were manually revised to derive hand-crafted location frames. Additionally, a location sentence classifier was constructed using a collection of 2264 protein location description sentences. A central component of PLAN2L is the scoring of each evidence sentence according to its relevance for complex temporal biological events (topics), at the cellular level (cell cycle) as well as at the level of developmental processes. We therefore implemented a classifier for scoring cell-cycle relevant abstracts and document passages. The SVM text classifier was trained on a collection of cell-cycle relevant abstracts and nonrelevant abstracts and then applied to a literature collection of abstracts and full-text articles mentioning A. thaliana genes. Additionally, four specific sentence classifiers for the most relevant developmental processes in higher plants, namely (i) flowering, (ii) leaf development, (iii) root development and (iv) seed development/germination have been developed. The tool provides a comprehensive approach to assist in the selection and ranking of genes, proteins, documents and terms relevant to a specific biological process for this model organism. Additional details on the different modules, their characteristics and assessment are provided at the PLAN2L web. FUNCTIONALITY AND USAGE The PLAN2L interface handles user-provided plain text keywords, protein/gene names or symbols. Some of the components allow additionally searching with gene or protein identifiers. The currently supported identifiers include TAIR gene identifiers and UniProt primary accession numbers. Based on a user survey that we carried out to get feedback from biologists on the PLAN2L system, aspects that were positively rated included its easy to use and intuitive query interface and the direct retrieval of evidence sentences for multiple topics. Aspects that were improved according to the user comments covered additional documentation on the system and the sentence scoring mechanism. PLAN2L supports six types of search strategies, each with its own query page, to avoid introducing unnecessary complexity through advanced search interfaces with complicated menu options. We will briefly describe each of these six search types and provide a case study to illustrate the type of results generated by PLAN2L.
Case study: AGAMOUS and LEUNIG As an example, to illustrate the kind of output generated by PLAN2L, we searched the system using the AGAMOUS (TAIR locus AT4G18960). The basic search for this bio-entity returns a range of descriptive sentences, Figure 2a
Implementation, user testing and availability PLAN2L is mainly written in Python and uses the Zope web application server (www.zope.org) to display the online results. Some of the protein normalization modules are implemented in C, and a collection of additional preprocessing and NLP components are written in Perl. The sentence classifier component relied on the SVMLight package. The relational database server is PostgreSQL, hosts part of the underlying text corpora and lexical resources. The Zope server runs on a HP 360 G5 Intel(R) Xeon(TM) CPU 3.40 GHz machine with 3 GB de RAM. The initial version of this system (PLAN2L-aratreg) has been online since 10 January 2007, and has been improved based on the demands of the EU funded DIAMONDS consortium. Since the first of January 2009, the system had been tested by over 890 visitors. Individual user feedback and requests are being collected to improve the practical usefulness of PLAN2L for the plant science community. The PLAN2L system is available at: http://zope.bioinfo.cnio.es/plan2l. PLAN2L has been tested on the most common browsers (Firefox, Safari and Internet Explorer), accounting for 95.88% of the PLAN2L users. Online help, including documentation and a tutorial using prerun example cases as well as additional details on the system evaluation, and user feedback (FAQ) are provided together with the online application. DISCUSSION AND CONCLUSION We described a text mining system, PLAN2L that enables exploration of literature information at different levels of granularity, from retrieval of gene description sentences derived from multiple documents, to qualified biological relations important to understand the sub-cellular context and both physical as well as regulatory interaction networks of bio-entities. PLAN2L extracts biological information from both abstracts as well as full-text articles and integrates different language processing strategies from simple co-occurrence to syntactic/semantic rule-based algorithms and supervised machine learning methods. PLAN2L is intended to be useful for general retrieval, topic-specific retrieval as well as for finding association evidences for user-specified entities and for knowledge and hypothesis confirmation. A similar strategy as used for PLAN2L can be easily adapted to other model organisms as well as specific biological topics with minimal additional manual data preparation. FUNDING ENFIN (LSHG-CT-2005-518254); the BIOSAPIENS (LSG-CT-2003-503265); DIAMONDS (LSHG-CT-2004-512143) projects. Funding for open access charge: ENFIN (LSHG-CT-2005-518254). Conflict of interest statement. None declared. ACKNOWLEDGEMENTS Many thanks to Florian Leitner and other users including Ernesto Ortiz, Cameron MacPherson, Nathan McCorkle, Colleen Doherty and Anais Baudot for useful feedback. REFERENCES 1. Berardini TZ, Mundodi S, Reiser L, Huala E, Garcia-Hernandez M, Zhang P, Mueller LA, Yoon J, Doyle A, Lander G, et al. Functional annotation of the Arabidopsis genome using controlled vocabularies. Plant Physiol. 2004;135:745–755. [PubMed] 2. Krallinger M, Valencia A, Hirschman L. Linking genes to literature: text mining, information extraction, and retrieval applications for biology. Genome Biol. 2008;9(Suppl. 2):S8–S8. [PubMed] 3. Hoffmann R, Valencia A. Implementing the iHOP concept for navigation of biomedical literature. Bioinformatics. 2005;21(Suppl. 2):ii252–ii258. [PubMed] 4. Bevan M, Walsh S. The Arabidopsis genome: a foundation for plant research. Genome Res. 2005;15:1632–1642. [PubMed] 5. Swarbreck D, Wilks C, Lamesch P, Berardini TZ, Garcia-Hernandez M, Foerster H, Li D, Meyer T, Muller R, Ploetz L, et al. The Arabidopsis Information Resource (TAIR): gene structure and function annotation. Nucleic Acids Res. 2008;36:D1009–D1014. [PubMed] 6. Schneider M, Bairoch A, Wu CH, Apweiler R. Plant protein annotation in the UniProt Knowledgebase. Plant Physiol. 2005;138:59–66. [PubMed] 7. Bajic VB, Veronika M, Veladandi PS, Meka A, Heng M.-W, Rajaraman K, Pan H, Swarup S. Dragon plant biology explorer. A text-mining tool for integrating associations between genetic and biochemical entities with genome annotation and biochemical terms lists. Plant Physiol. 2005;138:1914–1925. [PubMed] 8. Yoo D, Xu I, Berardini TZ, Yon Rhee S, Narayanasamy V, Twigger S. PubSearch and PubFetch: a simple management system for semiautomated retrieval and annotation of biological information from the literature. Curr. Protoc. Bioinformatics. 2006 Chapter 9, Unit9.7–Unit9.7. 9. Müller H.-M, Kenny EE, Sternberg PW. Textpresso: an ontology-based information retrieval and extraction system for biological literature. PLoS Biol. 2004;2:e309–e309. [PubMed] 10. Schmid H. Proceedings of the International Conference on New Methods in Language Processing. Manchester, UK: 1994. Probabilistic part-of-speech tagging using decision trees; pp. 44–49. 11. Abney S. Partial parsing via finite-state cascades. Nat. Lang. Engg. 1996;2:337–344. 12. Krallinger M, Morgan A, Smith L, Leitner F, Tanabe L, Wilbur J, Hirschman L, Valencia A. Evaluation of text-mining systems for biology: overview of the Second BioCreative community challenge. Genome Biol. 2008;9(Suppl. 2):S1–S1. [PubMed] |
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||
Plant Physiol. 2004 Jun; 135(2):745-55.
[Plant Physiol. 2004]Genome Biol. 2008; 9 Suppl 2():S8.
[Genome Biol. 2008]Bioinformatics. 2005 Sep 1; 21 Suppl 2():ii252-8.
[Bioinformatics. 2005]Genome Res. 2005 Dec; 15(12):1632-42.
[Genome Res. 2005]Nucleic Acids Res. 2008 Jan; 36(Database issue):D1009-14.
[Nucleic Acids Res. 2008]Genome Biol. 2008; 9 Suppl 2():S1.
[Genome Biol. 2008]