• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of ploscompComputational BiologyView this ArticleSubmit to PLoSGet E-mail AlertsContact UsPublic Library of Science (PLoS)
PLoS Comput Biol. Dec 2009; 5(12): e1000597.
Published online Dec 24, 2009. doi:  10.1371/journal.pcbi.1000597
PMCID: PMC2791166

Biomedical Text Mining and Its Applications

Fran Lewitter, Editor

Introduction

This tutorial is intended for biologists and computational biologists interested in adding text mining tools to their bioinformatics toolbox. As an illustrative example, the tutorial examines the relationship between progressive multifocal leukoencephalopathy (PML) and antibodies. Recent cases of PML have been associated to the administration of some monoclonal antibodies such as efalizumab [1]. Those interested in a further introduction to text mining may also want to read other reviews [2][4].

Understanding large amounts of text with the aid of a computer is harder than simply equipping a computer with a grammar and a dictionary. A computer, like a human, needs certain specialized knowledge in order to understand text. The scientific field that is dedicated to train computers with the right knowledge for this task (among other tasks) is called natural language processing (NLP). Biomedical text mining (henceforth, text mining) is the subfield that deals with text that comes from biology, medicine, and chemistry (henceforth, biomedical text). Another popular name is BioNLP, which some practitioners use as synonymous with text mining.

Biomedical text is not a homogeneous realm [5]. Medical records are written differently from scientific articles, sequence annotations, or public health guidelines. Moreover, local dialects are not uncommon [6]. For example, medical centers develop their own jargons and laboratories create their idiosyncratic protein nomenclatures. This variability means, in practice, that text mining applications are tailored to specific types of text. In particular, for reasons of availability and cost, many are designed for scientific abstracts in English from Medline.

Main Concepts

Terms

A term is a name used in a specific domain, and a terminology is a collection of terms. Terms abound in biomedical text, where they constitute important building blocks. Some examples of terms are the names of cell types, proteins, medical devices, diseases, gene mutations, chemical names, and protein domains [7]. Due to their importance, text miners have worked to design algorithms that recognize terms (see examples in Figure 1). The task of recognizing terms is also called named entity recognition in the text mining literature, although this NLP task is broader and goes beyond recognition of terms. Although the concept of term is intuitive (or, perhaps, because it is intuitive), terms are hard to define precisely [8]. For example, the text “early progressive multifocal leukoencephalopathy” could possibly refer to any, or all, of these disease terms: “early progressive multifocal leukoencephalopathy,” “progressive multifocal leukoencephalopathy,” “multifocal leukoencephalopathy,” and “leukoencephalopathy.” To overcome such dilemmas, text miners ask experts to identify terms within collections of text such as sets of selected Medline abstracts. These annotations are then used to train a computer by example, so that the computer can emulate the knowledge experts deploy when they read biomedical text. This pedagogical method, “teaching by example,” is a common approach used in many text mining tasks and it is more generally called supervised training. (Alternatively, text miners create rules using expert knowledge.) Thus, text miners rely heavily on collections of text (corpora) that have been annotated by experts (see compilations of corpora: http://www2.informatik.hu-berlin.de/~ hakenber/links/benchmarks.html; http://compbio.uchsc.edu/ccp/corpora/obtaining.shtml). Before beginning a text mining task, it is advisable to limit the scope of the task to a corpus made of a set of documents around the topic of interest. In our case, a PML corpus could comprise all the Medline abstracts that mention the term “progressive multifocal leukoencephalopathy,” because this is an unambiguous term. Another relevant corpus to consider could be the ImmunoTome [9], which is focused on immunology.

Figure 1
Examples of term recognition.

Text miners are interested in terminologies that have been built manually. These controlled terminologies have notable roles in biomedicine, for example, the HUGO gene nomenclature, the ICD disease classification, or the Gene Ontology. Many of these terminologies are more than just a flat list of terms. Some include term synonyms (thesauri) or relations between terms (taxonomies, ontologies). For text miners, their usefulness comes from their ability to link to information. Once a text is mapped to one of these terminologies, a bridge is opened between the text and other resources. This usefulness justifies efforts such as the National Library of Medicine's manual mapping of Medline abstracts to the Medical Subject Headings (MeSH) terminology. In our example, MeSH can be used to make the PML corpus more focused by restricting it only to abstracts with the MeSH term “leukoencephalopathy, progressive multifocal.” Controlled terminologies can be used to annotate results from experiments and databases [10]. Text miners attempt to make such mappings automatically. For example, a task called gene normalization consists in recognizing names of genes in text and mapping them to their corresponding gene identifiers (e.g., Entrez Gene ID). Thus, using gene normalization it is possible to identify all the abstracts in Medline that mention a given gene from Entrez Gene [11].

Because there are many controlled terminologies, some terminologies have been created to map between them. For example, the BioThesaurus [12] is a compilation of protein synonyms from several terminologies. The Unified Medical Language System (UMLS) [13],[14] is a grand compilation of more than 120 terminologies and close to 4 million terms. Despite UMLS's size, all controlled terminologies are incomplete, because new terms are created too quickly to keep them up to date. Furthermore, all have gaps and areas of emphasis that conflict with the needs of users.

Tools for Terms

Whatizit [15] is a tool that recognizes several types of terms. It can be accessed through a Web interface, Web services, or a streamed servlet. Abner [16] is a standalone application that recognizes five types of terms: protein, DNA, RNA, cell line, and cell type. More specialized term recognition has been used, for example, for databases such as LSAT [17] for alternative transcripts and PepBank [18] for peptides. Text miners have also used terminologies to enrich PubMed's search capabilities. Some recent search engines are semedico [19], novo|seek [20], and GoPubMed/GoGene [21],[22].

Relationships

After recognizing terms, the natural next step is to look for relationships between terms. The simplest method to identify relationships is using the co-occurrence assumption: terms that appear in the same texts tend to be related. For example, if a protein is mentioned often in the same abstracts as a disease, it is reasonable to hypothesize that the protein is involved in some aspect of the disease. The degree of co-occurrence can be quantified statistically to rank and eliminate statistically weak co-occurrences (see Box 1). An example using GoGene [22] can illustrate the use of simple co-occurrence, MeSH terms, and gene normalization. The query “leukoencephalopathy, progressive multifocal”[mh] in GoGene returns all the genes mentioned in Medline abstracts annotated with the MeSH term for PML. The genes that appear most often are likely to be related to PML. Those that appear disproportionately more often for PML than for other diseases are likely to be more specific to PML.

Box 1. The strength of a relationship. The confidence in a fact that comes from text can be qualified by the level of certainty of the assertion where the fact was found or by the strength of the evidence pointed [71]. Since facts do not stand alone, this confidence depends also on the fact's consistency with related facts [72]. In the case of co-occurrence of two terms t1 and t2, the simplest confidence metric is the count c of texts that include both terms, An external file that holds a picture, illustration, etc.
Object name is pcbi.1000597.e001.jpg (for a PPI example, see [73]). This measure can be normalized by the possibility of random co-occurrences due to the sheer popularity of one or both terms. For example,

equation image

Pointwise mutual information (PMI) is similarly derived as

equation image

where An external file that holds a picture, illustration, etc.
Object name is pcbi.1000597.e004.jpg, in this case, is An external file that holds a picture, illustration, etc.
Object name is pcbi.1000597.e005.jpg divided by the total number of texts. More generally, different measures can be drawn from the 2×2 contingency table that encompasses the counts of texts that include the two terms, An external file that holds a picture, illustration, etc.
Object name is pcbi.1000597.e006.jpg, only one term (An external file that holds a picture, illustration, etc.
Object name is pcbi.1000597.e007.jpg and An external file that holds a picture, illustration, etc.
Object name is pcbi.1000597.e008.jpg), and none, An external file that holds a picture, illustration, etc.
Object name is pcbi.1000597.e009.jpg. Using this contingency table, Medgene [32] compared the merit of different statistical measures for gene-disease associations such as chi-square analysis, Fisher's exact probabilities, relative risk of gene, and relative risk of disease. More heuristic methods have been devised that use manually adjusted weights for different types of co-occurrence [36].

Better evidence than co-occurrence comes from relationships that are described explicitly [23]. For example, the sentence “We describe a PML in a 67-year-old woman with a destructive polyarthritis associated with anti-JO1 antibodies treated with corticosteroids” [24] describes an explicit link between PML and anti-JO1 antibodies. We can simplify this relationship into a triplet of two terms and a verb: PML is associated with anti-JO1 antibodies. To create the triplet, the verb can be identified with the aid of a part-of-speech (POS) tagger. An example of a POS tagger for biomedical text is MedPost [25]. This triplet representation is powerful due to its simplicity, but it omits crucial details from the original article, such as the fact that the evidence comes from a clinical case study.

A heavily studied area in text mining concerns the relationships known as protein-protein interactions (PPI). Using the triplet representation, PPI can be depicted as network graphs with the proteins as nodes and the verbs as edges (see Figure 2). When analyzing text-mined interaction networks, it is important to understand the information that underpins them. For example, interactions can be direct (physical) or indirect, depending on the verb (examples of direct verbs are to bind, to stabilize, to phosphorylate; examples of indirect verbs are to induce, to trigger, to block) [26]. The different nature of the protein interactions described in the literature reflects in part the experimental methodology employed and the nature of the interaction itself. A common way to capture the textual variations is by exhaustively identifying all the patterns that appear and writing a set of rules that capture them [27],[28]. For example, a simple pattern to capture phosphorylations might involve, sequentially, a kinase name, a form of the verb to phosphorylate, and a substrate name [29],[30].

Figure 2
Example of text-mined PPI network.

Tools for Relationships

To see co-occurrence in action, try FACTA [31]. MedGene and BioGene [32],[33] use co-occurrence for gene prioritization. Gene prioritization tools such as Endeavour [34] and G2D [35] use text as well as other data sources. PolySearch [36] uses heuristic weighting of different co-occurrence measures and includes a detailed guide to implementation and vocabularies. Anni [37] uses textual profiles instead of co-occurrence to measure relationship between terms. For PPI, iHOP [38] is the most popular tool. RLIMS-P [30] uses linguistic patterns to detect the kinase, substrate, and phosphosite in a phosphorylation. E3Miner [39] detects ubiquitinations, including contextual information.

Discovery

Besides finding relationships, text miners are also interested in discovering relationships. Due to the size of the literature, scientists miss links between their work and other, related work. Swanson called these links “undiscovered public knowledge.” In a classic example he found by careful reading 11 links between magnesium and migraine that had been neglected [40]. One method to discover relationships is based on transitive inference [41]. Simply stated, if A is linked to B, and B is linked to C, then there is a chance that A is linked to C. PPI networks are, at the core, an example of transitive inference. Arrowsmith [42] is a basic discovery tool that compares two literature sets to find links between them. Applying Arrowsmith to the literature for PML and antibodies yields the immunomodulator tacrolimus, a calcineurin inhibitor, among the top hits. Tacrolimus affects the production of several proteins depicted in Figure 2, such as IL-2.

Quality

The most common measure of output quality in text mining is the F-measure, which is the harmonic mean of two other measures, precision and recall. These three measures can be described with the analogy of searching for needles in a haystack. After a manual search of a haystack, our hands end up full with valuable needles but also with some useless straws. Recall is based on the number of needles found. High recall means that we have found most of the needles for which we were looking. Precision, however, is based on the number of both needles and straws. High precision means that we have retrieved far more needles than straws. Both high precision and high recall are desirable, and a high F-measure reflects both because it is the harmonic mean. Optimizing the F-measure of a text mining application is often different from optimizing the accuracy, because there are usually few needles and large amounts of hay in the haystack. An application that identifies the whole haystack as being only hay is quite accurate but misses all the needles.

It is important to ponder over the way an application has been evaluated before assessing its F-measure [43], and especially to consider how realistic the evaluation was. The F-measure is not an absolute value. The larger a haystack is, the more difficult it is to find needles. In other words, a low F-measure might reflect a harder task, not a worse application. Moreover, text mined applications may perform differently in different types of text and this may be reflected in lower F-measures than advertised. When the F-measure attainable is not high enough, one solution is to use text mining as a filter. A filter needs high recall, but only moderate precision, to reduce the amount of hay without affecting the needles. Filtering with text mining is used as a preliminary step in databases such as MINT [44], DIP [45], and BIND [46]. Filtering is followed by human curation, which involves the review and assessment of results to reduce hay and, hopefully, provide feedback to improve the filtering. The feedback loop between text mining and curation can have an incremental positive impact in output results [47].

Comprehensiveness

Doing comprehensive text mining means considering all sources of information—Medline and beyond. The abstract conveys an article's main findings, but many other pieces of information are elsewhere in the full text, figures, tables, supplementary information, references, databases, Web sites, and multimedia files. In particular, the full text is critical for information that rarely appears in abstracts, such as experimental measurements. A more comprehensive PML corpus would include full text articles, however despite the surge in open access articles (see the Directory of Open Access Journals, www.doaj.org; [48]), the majority of published articles have access and processing restrictions. PubMed Central [49] is the main source of open access articles, and the specialized search engines BioText [50], Yale Image Finder [51], and Figurome [52] search PubMed Central figures and tables. A search for “progressive multifocal leukoencephalopathy” in the Yale Image Finder yields only one figure, while a search for “PML” yields a large number of hits, most of them not relevant because PML is an ambiguous acronym.

Text and DNA

Considering text as a sequence of symbols as informative as a protein's DNA sequence is the underlying premise of many text mining tools for bioinformatics. For example, the linguistic similarity between protein corpora (sets of texts built around proteins) correlates with the BLAST score between those same proteins [53]. Text that is used in articles or database annotations to describe a protein can be used for protein clustering and to predict structure [54], subcellular localization, and function [55]. For example, a protein corpus of a protein located in the nucleus uses a vocabulary that is somewhat different from a corpus built around a secreted protein. These vocabulary differences can be used to predict the subcellular localization of a protein of unknown location. One way to measure vocabulary differences is to represent the texts as vectors of word counts. The word counts can be normalized by the size of the text they come from and the vectors compared using, for example, Euclidean distance (for more, see [56]). To reduce vector dimensionality, some words can be grouped using a method called stemming. A simple example of stemming is converting plural nouns into singular form and verbs into infinitive form (a widely used stemming algorithm is the Porter stemmer [57]). Additional simplification can be achieved via tokenization, because some words can be separated into constitutive elements called tokens. In English, however, most words are a single token. An example of a word of two tokens is don't.

Text mining applications for bioinformatics [58] include subcellular localization prediction such as Sherloc and Epiloc [59],[60] and protein clustering such as TXTGate [61]. Thus, text mining tools can be used for annotating biological databases in the same fashion other bioinformatics tools are used.

More Tools

An extensive list of text mining applications is maintained in http://zope.bioinfo.cnio.es/bionlp_tools/ [62]. A growing number of tools are being developed under a standard framework called UIMA, which comprises NLP as well as BioNLP tools [63].

Conclusion

Text mining tools are increasingly more accessible to biologists and computational biologists and these can often be applied to answer scientific questions in combination with other bioinformatics tools. Getting acquainted with them is a first step towards grasping the possibilities of text mining and towards venturing into the algorithms described in the literature. One way to get started on this path is by looking at examples such as [64][67].

Acknowledgments

I would like to thank Rohitha P. SriRamaratnam for comments on the manuscript.

Footnotes

The author has declared that no competing interests exist.

The author received no specific funding for this work.

References

1. Sobell JM, Weinberg JM. Patient fatalities potentially associated with efalizumab use. J Drugs Dermatol. 2009;8:215. [PubMed]
2. Cohen KB, Hunter L. Getting started in text mining. PLoS Comput Biol. 2008;4:e20. doi: 10.1371/journal.pcbi.0040020. [PMC free article] [PubMed]
3. Rzhetsky A, Seringhaus M, Gerstein MB. Getting started in text mining: part two. PLoS Comput Biol. 2009;5:e1000411. doi: 10.1371/journal.pcbi.1000411. [PMC free article] [PubMed]
4. Rzhetsky A, Seringhaus M, Gerstein M. Seeking a new biology through text mining. Cell. 2008;134:9–13. [PMC free article] [PubMed]
5. Friedman C, Kra P, Rzhetsky A. Two biomedical sublanguages: a description based on the theories of Zellig Harris. J Biomed Inform. 2002;35:222–235. [PubMed]
6. Netzel R, Perez-Iratxeta C, Bork P, Andrade MA. The way we write. EMBO Rep. 2003;4:446–451. [PMC free article] [PubMed]
7. Krauthammer M, Nenadic G. Term identification in the biomedical literature. J Biomed Inform. 2004;37:512–526. [PubMed]
8. Tanabe L, Xie N, Thom LH, Matten W, Wilbur WJ. GENETAG: a tagged corpus for gene/protein named entity recognition. BMC Bioinformatics. 2005;6(Suppl 1):S3. [PMC free article] [PubMed]
9. Kabiljo R, Shepherd AJ. Protein name tagging in the immunological domain. Proceedings of the Third International Symposium on Semantic Mining in Biomedicine (SMBM 2008) 2008:141–144.
10. Lu X, Zhai C, Gopalakrishnan V, Buchanan BG. Automatic annotation of protein motif function with Gene Ontology terms. BMC Bioinformatics. 2004;5:122. [PMC free article] [PubMed]
11. Morgan AA, Lu Z, Wang X, Cohen AM, Fluck J, et al. Overview of BioCreative II gene normalization. Genome Biol. 2008;9(Suppl 2):S3. [PMC free article] [PubMed]
12. Liu H, Hu ZZ, Zhang J, Wu C. BioThesaurus: a web-based thesaurus of protein and gene names. Bioinformatics. 2006;22:103–105. Available: http://pir.georgetown.edu/pirwww/iprolink/biothesaurus.shtml. [PubMed]
13. Bangalore A, Thorn KE, Tilley C, Peters L. The UMLS knowledge source server: an object model for delivering UMLS data. AMIA Annu Symp Proc. 2003:51–55. Available: http://www.nlm.nih.gov/research/umls/ [PMC free article] [PubMed]
14. Aronson AR. Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program. Proc AMIA Symp. 2001:17–21. Available: http://mmtx.nlm.nih.gov/ [PMC free article] [PubMed]
15. Rebholz-Schuhmann D, Arregui M, Gaudan S, Kirsch H, Jimeno A. Text processing through web services: calling Whatizit. Bioinformatics. 2008;24:296–298. Available: http://www.ebi.ac.uk/webservices/whatizit/info.jsf. [PubMed]
16. Settles B. ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text. Bioinformatics. 2005;21:3191–3192. Available: http://pages.cs.wisc.edu/~ bsettles/abner/ [PubMed]
17. Shah PK, Bork P. LSAT: learning about alternative transcripts in MEDLINE. Bioinformatics. 2006;22:857–865. Available: http://www.bork.embl.de/LSAT. [PubMed]
18. Shtatland T, Guettler D, Kossodo M, Pivovarov M, Weissleder R. PepBank–a database of peptides based on sequence text mining and public peptide data sources. BMC Bioinformatics. 2007;8:280. Available: http://pepbank.mgh.harvard.edu/ [PMC free article] [PubMed]
19. Wermter J, Tomanek K, Hahn U. High-performance gene name normalization with GeNo. Bioinformatics. 2009;25:815–821. Available: http://www.semedico.org/ [PubMed]
20. Alonso-Allende R. Accelerating searches of research grants and scientific literature with novo|seek. 2009. Nat Methods 6. Advertising feature. Available: http://www.novoseek.com/
21. Doms A, Schroeder M. GoPubMed: exploring PubMed with the Gene Ontology. Nucleic Acids Res. 2005;33:W783–W786. Available: http://www.gopubmed.com. [PMC free article] [PubMed]
22. Plake C, Royer L, Winnenburg R, Hakenberg J, Schroeder M. GoGene: gene annotation in the fast lane. Nucleic Acids Res 37(Web Server issue) 2009:W300–W304. Available: http://www.gopubmed.org/gogene/ [PMC free article] [PubMed]
23. Shatkay H, Pan F, Rzhetsky A, Wilbur WJ. Multi-dimensional classification of biomedical text: toward automated, practical provision of high-utility text to diverse users. Bioinformatics. 2008;24:2086–2093. [PMC free article] [PubMed]
24. Viallard JF, Lazaro E, Ellie E, Eimer S, Camou F, et al. Improvement of progressive multifocal leukoencephalopathy after cidofovir therapy in a patient with a destructive polyarthritis. Infection. 2007;35:33–36. [PubMed]
25. Smith L, Rindflesch T, Wilbur WJ. MedPost: a part-of-speech tagger for bioMedical text. Bioinformatics. 2004;20:2320–2321. Available: http://www.ncbi.nlm.nih.gov/staff/lsmith/MedPost.html. [PubMed]
26. Santos C, Eggle D, States DJ. Wnt pathway curation using automated natural language processing: combining statistical methods with partial and full parse for knowledge extraction. Bioinformatics. 2005;21:1653–1658. [PubMed]
27. Friedman C, Kra P, Yu H, Krauthammer M, Rzhetsky A. GENIES: a natural-language processing system for the extraction of molecular pathways from journal articles. Bioinformatics. 2001;17(Suppl 1):S74–S82. [PubMed]
28. Blaschke C, Valencia A. The potential use of SUISEKI as a protein interaction discovery tool. Genome Inform. 2001;12:123–134. [PubMed]
29. Hu ZZ, Narayanaswamy M, Ravikumar KE, Vijay-Shanker K, Wu CH. Literature mining and database annotation of protein phosphorylation using a rule-based system. Bioinformatics. 2005;21:2759–2765. [PubMed]
30. Yuan X, Hu ZZ, Wu HT, Torii M, Narayanaswamy M, et al. An online literature mining tool for protein phosphorylation. Bioinformatics. 2006;22:1668–1669. Available: http://pir.georgetown.edu/pirwww/iprolink/rlimsp.shtml. [PubMed]
31. Tsuruoka Y, Tsujii J, Ananiadou S. FACTA: a text search engine for finding associated biomedical concepts. Bioinformatics. 2008;24:2559–2560. Available: http://text0.mib.man.ac.uk/software/facta/ [PMC free article] [PubMed]
32. Hu Y, Hines LM, Weng H, Zuo D, Rivera M, et al. Analysis of genomic and proteomic data using advanced literature mining. J Proteome Res. 2003;2:405–412. Available: http://medgene.med.harvard.edu/MEDGENE/ [PubMed]
33. Rolfs A, Hu Y, Ebert L, Hoffmann D, Zuo D, et al. A biomedically enriched collection of 7000 human ORF clones. PLoS ONE. 2008;3:e1528. Available: http://biogene.med.harvard.edu/BIOGENE/ [PMC free article] [PubMed]
34. Aerts S, Lambrechts D, Maity S, Van Loo P, Coessens B, et al. Gene prioritization through genomic data fusion. Nat Biotechnol. 2006;24:537–544. Available: http://homes.esat.kuleuven.be/~ bioiuser/endeavour/endeavour.php. [PubMed]
35. Perez-Iratxeta C, Wjst M, Bork P, Andrade MA. G2D: a tool for mining genes associated with disease. BMC Genet. 2005;6:45. [PMC free article] [PubMed]
36. Cheng D, Knox C, Young N, Stothard P, Damaraju S, et al. PolySearch: a web-based text mining system for extracting relationships between human diseases, genes, mutations, drugs and metabolites. Nucleic Acids Res. 2008;36:W399–W405. Available: http://wishart.biology.ualberta.ca/polysearch/index.htm. [PMC free article] [PubMed]
37. Jelier R, Schuemie MJ, Veldhoven A, Dorssers LC, Jenster G, et al. Anni 2.0: a multipurpose text-mining tool for the life sciences. Genome Biol. 2008;9:R96. Available: http://www.biosemantics.org/index.php?page=anni-2-0. [PMC free article] [PubMed]
38. Hoffmann R, Valencia A. A gene network for navigating the literature. Nat Genet. 2004;36:664. Available: http://www.ihop-net.org/ [PubMed]
39. Lee H, Yi GS, Park JC. E3Miner: a text mining tool for ubiquitin-protein ligases. Nucleic Acids Res. 2008;36:W416–W422. Available: http://e3miner.biopathway.org. [PMC free article] [PubMed]
40. Swanson DR. Migraine and magnesium: eleven neglected connections. Perspect Biol Med. 1988;31:526–557. [PubMed]
41. Weeber M, Kors JA, Mons B. Online tools to support literature-based discovery in the life sciences. Brief Bioinform. 2005;6:277–286. [PubMed]
42. Smalheiser NR, Torvik VI, Zhou W. Arrowsmith two-node search interface: a tutorial on finding meaningful links between two disparate sets of articles in MEDLINE. Comput Meth Program Biomed. 2009;94:190–197. Available: http://arrowsmith.psych.uic.edu/cgi-bin/arrowsmith_uic/start.cgi. [PMC free article] [PubMed]
43. Caporaso JG, Deshpande N, Fink JL, Bourne PE, Cohen KB, et al. Intrinsic evaluation of text mining tools may not predict performance on realistic tasks. Pac Symp Biocomput. 2008:640–651. [PMC free article] [PubMed]
44. Zanzoni A, Montecchi-Palazzi L, Quondam M, Ausiello G, Helmer-Citterich M, et al. MINT: a Molecular INTeraction database. FEBS Lett. 2002;513:135–140. [PubMed]
45. Marcotte EM, Xenarios I, Eisenberg D. Mining literature for protein-protein interactions. Bioinformatics. 2001;17:359–363. [PubMed]
46. Donaldson I, Martin J, de Bruijn B, Wolting C, Lay V, et al. PreBIND and Textomy–mining the biomedical literature for protein-protein interactions using a support vector machine. BMC Bioinformatics. 2003;4:11. [PMC free article] [PubMed]
47. Rodriguez-Esteban R, Iossifov I, Rzhetsky A. Imitating manual curation of text-mined facts in biomedicine. PLoS Comput Biol. 2006;2:e118. doi: 10.1371/journal.pcbi.0020118. [PMC free article] [PubMed]
48. Wadman M. Open-access policy flourishes at NIH. Nature. 2009;458:690–691. [PubMed]
49. Vastag B. NIH launches PubMed Central. J Natl Cancer Inst. 2000;92:374. Available: http://www.ncbi.nlm.nih.gov/pmc/ [PubMed]
50. Hearst MA, Divoli A, Guturu H, Ksikes A, Nakov P, et al. BioText Search Engine: beyond abstract search. Bioinformatics. 2007;23:2196–2197. Available: http://biosearch.berkeley.edu/ [PubMed]
51. Xu S, McCusker J, Krauthammer M. Yale Image Finder (YIF): a new search engine for retrieving biomedical images. Bioinformatics. 2008;24:1968–1970. Available: http://krauthammerlab.med.yale.edu/imagefinder/ [PMC free article] [PubMed]
52. Rodriguez-Esteban R, Iossifov I. Figure mining for biomedical research. Bioinformatics. 2009;25:2082–2084. [PubMed]
53. Yandell MD, Majoros WH. Genomics and natural language processing. Nat Rev Genet. 2002;3:601–610. [PubMed]
54. Koussounadis A, Redfern OC, Jones DT. Improving classification in protein structure databases using text mining. BMC Bioinformatics. 2009;10:129. [PMC free article] [PubMed]
55. Pandev G, Kumar V, Steinbach M. Computational approaches for protein function prediction: a survey. 2006. Technical Report 06-028, Department of Computer Science and Engineering, University of Minnesota, Twin Cities.
56. Manning CD, Schutze H. Foundations of Statistical Natural Language Processing. MIT Press; 1999.
57. Van Rijsbergen CJ, Robertson SE, Porter MF. New models in probabilistic information retrieval. 1980. Tech. Rep. 5587. British Library. Available: http://tartarus.org/~ martin/PorterStemmer/
58. Krallinger M, Valencia A. Text-mining and information-retrieval services for molecular biology. Genome Biol. 2005;6:224. [PMC free article] [PubMed]
59. Shatkay H, Höglund A, Brady S, Blum T, Dönnes P, et al. SherLoc: high-accuracy prediction of protein subcellular localization by integrating text and protein sequence data. Bioinformatics. 2007;23:1410–1417. Available: http://www-bs.informatik.uni-tuebingen.de/Services/SherLoc2/ [PubMed]
60. Brady S, Shatkay H. EpiLoc: a (working) text-based system for predicting protein subcellular location. Pac Symp Biocomput. 2008:604–615. Available: http://epiloc.cs.queensu.ca/ [PubMed]
61. Glenisson P, Coessens B, Van Vooren S, Mathys J, Moreau Y, et al. TXTGate: profiling gene groups with text-based information. Genome Biol. 2004;5:R43. Available: http://tomcat.esat.kuleuven.be/txtgate/ [PMC free article] [PubMed]
62. Krallinger M, Hirschman L, Valencia A. Linking genes to literature: text mining, information extraction, and retrieval applications for biology. Genome Biol. 2008;9:S8. Available: http://zope.bioinfo.cnio.es/bionlp_tools/ [PMC free article] [PubMed]
63. Kano Y, Baumgartner WA, Jr, McCrohon L, Ananiadou S, Cohen KB, et al. U-Compare: share and compare text mining tools with UIMA. Bioinformatics. 2009;25:1997–1998. Available: http://u-compare.org/ [PMC free article] [PubMed]
64. Ramialison M, Bajoghli B, Aghaallaei N, Ettwiller L, Gaudan S, et al. Rapid identification of PAX2/5/8 direct downstream targets in the otic vesicle by combinatorial use of bioinformatics tools. Genome Biol. 2008;9:R145. [PMC free article] [PubMed]
65. Natarajan J, Berrar D, Dubitzky W, Hack C, Zhang Y, et al. Text mining of full-text journal articles combined with gene expression analysis reveals a relationship between sphingosine-1-phosphate and invasiveness of a glioblastoma cell line. BMC Bioinformatics. 2006;7:373. [PMC free article] [PubMed]
66. Leach SM, Tipney H, Feng W, Baumgartner WA, Kasliwal P, et al. Biomedical discovery acceleration, with applications to craniofacial development. PLoS Comput Biol. 2009;5:e1000215. doi: 10.1371/journal.pcbi.1000215. [PMC free article] [PubMed]
67. Campillos M, Kuhn M, Gavin AC, Jensen LJ, Bork P. Drug target identification using side-effect similarity. Science. 2008;321:263–266. [PubMed]
68. Leitner F, Krallinger M, Rodriguez-Penagos C, Hakenberg J, Plake C, et al. Introducing meta-services for biomedical information extraction. Genome Biol. 2008;9(Suppl 2):S6. Available: http://bcms.bioinfo.cnio.es/ [PMC free article] [PubMed]
69. Fernández JM, Hoffmann R, Valencia A. iHOP web services. Nucleic Acids Res 35(Web Server issue) 2007:W21–W26. [PMC free article] [PubMed]
70. Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, et al. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Research. 2003;13:2498–2504. Available: http://www.cytoscape.org/ [PMC free article] [PubMed]
71. Wilbur WJ, Rzhetsky A, Shatkay H. New directions in biomedical text annotation: definitions, guidelines and corpus construction. BMC Bioinformatics. 2006;7:356. [PMC free article] [PubMed]
72. Rzhetsky A, Zheng T, Weinreb C. Self-correcting maps of molecular pathways. PLoS One. 2006;1:e61. doi: 10.1371/journal.pone.0000061. [PMC free article] [PubMed]
73. Jenssen TK, Laegreid A, Komorowski J, Hovig E. A literature network of human genes for high-throughput analysis of gene expression. Nat Genet. 2001;28:21–28. [PubMed]

Articles from PLoS Computational Biology are provided here courtesy of Public Library of Science
PubReader format: click here to try

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

  • PubMed
    PubMed
    PubMed citations for these articles

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...