NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

Madame Curie Bioscience Database [Internet]. Austin (TX): Landes Bioscience; 2000-.

Cover of Madame Curie Bioscience Database

Madame Curie Bioscience Database [Internet].

Show details

Extracting Information for Meaningful Function Inference Through Text-Mining

, , , , , , , , , , , , and *.

* Corresponding Author: Knowledge Extraction Lab, Institute for Infocomm Research, Singapore, Singapore. Email:

One of the emerging technologies in computational biology is text-mining, which includes natural language processing. This technology enables extraction of parts of relevant biological knowledge from a large volume of scientific documents in an automated fashion. We present several systems which cover different facets of text-mining biological information with applications in transcription control, metabolic pathways, and bacterial cross-species comparison. We demonstrate how this technology can efficiently support biologists and medical scientists to infer function of biological entities and save them a lot of time, paving the way for more focused and detailed follow-up research.


Text-mining of biomedical literature has received increased attention in the past several years.1-6 There are several reasons:

  1. a huge volume of the scientific documents available over internet to an average user;
  2. inability of an average user to keep track of all relevant documents in a specific domain of interest;
  3. inability of humans to keep track of associations usually contained in, or implied by, scientific texts; these associations could be either explicitly stated, such as ‘interaction of A and B’, or they need not necessarily be explicitly spelled out in a single sentence;
  4. inability of humans to simultaneously deal with a large volume of terms and their cross-referencing;
  5. necessity to search a number of different documents (or sometimes resources) to extract a set of relevant information;
  6. inability of a single user to acquire the required information in a relatively short (acceptable) time.

As an illustration, currently the PubMed repository ( contains over 14 million indexed documents.1 It is common that searches of PubMed frequently provide several hundred or more returned documents. Studying these large document sets is not an easy task for a single user. If the analysis has to be repeated several times with different selections of documents, then such a task is usually not feasible.

Text-mining is seen as an interesting and powerful supporting technology to complement research in biology and medicine. A number of text-mining systems which tackle different problems aimed at supporting biological and medical research, and which focus on different aspects of genomics, proteomics, or relations to diseases, have been reported.7-19

Computational biology produces answers, which form the bases and lead to better designs for further experimentation. Among the various computational biology approaches, text-mining systems provide a unique front where a large quantum of knowledge put out by experimental biologists can be efficiently screened using “vocabularies”, or standard terms adopted and used widely by biologists. Hence, such systems analyze the existing knowledge and uncover potential associations among biological entities or phenomena that can lead to further experimentation. In effect, text-mining-based approaches allow biologists to focus on certain unique aspects of information that would have been reported independently, thus not lending them for establishment of readily recognizable associations or correlations. Many such associations in biology go unnoticed until more directed studies are done to address the specific associations. Text-mining approaches, therefore, have the inherent capacity to help speed-up the rate of biological discovery.

In this chapter, we present several text-mining systems developed in our Knowledge Extraction Lab at the Institute for Infocomm Research, Singapore, two of which are the result of an on-going collaboration with the Department of Biological Sciences, National University of Singapore. We show how these systems can assist an average (nonexpert) user to better understand specific problems in biology and bring them closer to the answers about functions of biological entities inferred on the basis of an in silico method. Before we present these systems, we also define the problem we intend to deal with and describe some of the general features that text-mining system should provide to the end-users.

Scope and Nature of Text-Mining in Biomedical Domain

By automated knowledge extraction, we understand an automated extraction of names of entities, such as genes, gene products, metabolites, pathways, etc., which appear in biomedical and molecular biology literature, as well as the relationships between these entities. The basic relation between two entities is characterized by the co-occurrence of their names in the same document, or in a specific segment of the document. However, the actual relation between these entities is not easy to characterize by the computer program. It is customary to leave it to the user to assess the actual nature of such relations, based on the associated documents. To the best of our knowledge, very few text-mining systems exist which can accurately extract such relations.

Characteristics of Text-mining Systems

There are several basic features that text-mining systems should provide. These systems should:

  1. be easy to use;
  2. be interactive;
  3. allow several ways of submitting data;
  4. allow user to select categories of terms to be used in the analysis;
  5. provide suitable interactive summary reports;
  6. show association maps in suitable graphical format;
  7. preferably have built-in intelligence to filter out irrelevant documents;
  8. preferably be able to extract a large volume of useful information in a reasonable timeframe.

While, in principle, any free text document can be analyzed, for the purposes of discussion here, we will assume that documents are abstracts of scientific articles, such as those contained in the PubMed1 repository. Then, generally speaking, there could be three levels at which text analysis can be conducted: the ‘Abstract level’, the ‘Sentence level’, and the ‘Relation level’. At the ‘Abstract level’, the system analyzes the whole abstract to determine if it contains relations between the utilized biological entities or not. At the ‘Sentence level’, the system assesses whether the abstract analyzed contains sentences that explicitly claim relations between the entities or not; here, the individual sentences are analyzed as a whole. Finally, at the ‘Relation level’, the system attempts to extract specific entities and relations they are subjected to form the sentences that are assessed to contain such relations.

The systems, which we describe here, possess different combinations of these characteristics.

Focus of Our Text-Mining Systems

Regulatory systems in different organisms perform key functions of synchronizing events in the organism at different hierarchical levels. Networks of genes, proteins and metabolites activate and operate under different intra-cellular and extra-cellular conditions in order to provide proper responses of the organism to various stimulatory signals. Understanding the cause-consequence relations between entities in such complex systems and their relations to particular pathways can help us to understand ways of how the organism functions. This can also help us to find out how to control behavior of such regulatory networks and ultimately develop more efficient drugs or reduce problems of genetic-based diseases. Text-mining systems that we have developed aim at helping individual researchers to elucidate and partially reconstruct segments of such networks. Our text-mining systems focus on three general domains: transcription control, metabolic networks and gene networks. These are explained in more detail below.

Supporting Text-Mining Systems for Gene Regulatory Networks Reconstruction

Transcription control is the key mechanism for activation of genes and gene groups under different cellular conditions. Transcriptional regulatory networks20 provide information necessary to study different modalities of gene activation, such as tissue-specificity, timing, and rate of gene transcription. To infer parts of these control networks, we need to know the relations between different transcription factors (TFs) involved in the process of transcription initiation. For example, let us assume that hypothetical TF A and TF B are both gene products of genes GA and GB, respectively, and that a gene G requires a complex formed by A and B to bind to its promoter in order that transcription of G in a particular tissue can be activated. Then, if in the nucleus of the target cells there is no complex A/B, gene G will not activate. The necessary condition obviously is that both genes GA and GB produce A and B in sufficient quantities so that the A/B complex can be generated. Thus, activation of G requires activity of GA and GB. Also, in order that GA and GB be activated, another set of TF would be required, and so on. In this way, one can step by step reconstruct parts of the necessary gene networks which induce activation of gene G. However, we also notice that due to interaction of A and B a synchronous action of GA and GB would be expected, meaning that such two genes should be able to coexpress under particular specific conditions in some tissues. This, again, suggests that their promoters should share some degree of similarity in terms of promoter context (the type of TF binding sites, their ordering and partly spacing).21 Thus, analysis of promoters of one of these genes, say GA, may reveal part of the information about the promoters of GB. As can be seen in this hypothetical, but very common case, information about relation between A and B (they form a complex A/B) and information that A/B binds to promoter of G provides many additional clues of what to look for in reconstructing the gene network that controls activation of G in the target tissue. Text-mining tools can assist us tremendously in such tasks.

Relations between TFs are thus one of the key sources of information for reconstruction of parts of gene transcriptional regulatory networks. The TRANSCompel database22 is a resource which contains several hundreds of experimentally verified relations between TFs collected manually from the literature. These relations need not always be in the form of interactions. For example, it is possible to have information of the form: ‘if both present, A and B affect transcription of G’. These forms of cause-consequence relations may generally show synergistic effect on G when they enhance G's transcription, or antagonistic effect when they negatively impact transcription of G. Unfortunately, the content of the TRANSCompel database represents only a fraction of TF relations which is documented. It is a great challenge to collect this information and make it available to researchers in the gene regulatory networks field. One of our systems, Dragon TF Relation Extractor, allows such direct extraction of actual relations.

Dragon TF Relation Extractor (DTFRE)

This section describes DTFRE, a system that extracts the actual TF names and the type of relation(s) between them. The system is available for public and nonprofit use at DTFRE is developed based on manually cleaned large corpus of data. Within 8 months, five trained biologists and a chemist have read, analyzed, and classified more than 3000 PubMed abstracts related to transcription control in eukaryotes. Based on this information, we have generated a database of classified sentences about TF relations. This database (TFRD) contains 5244 sentences. Of these, 2402 sentences have been classified positive, i.e., as those which contain explicit statements about TF relations; 1766 sentences have been classified as negative entries, while there have been 1076 entries classified as ambiguous. To the best of our knowledge, TFRD represent the largest, manually cleaned dataset to date used for developing specialized text-mining systems for extracting TF relations from biomedical literature.

The system allows the user to submit data and to obtain a report which lists individual TFs and the extracted relations between them, as detected from the supplied data. Data to be submitted are simply the PubMed abstracts obtained as results of the search of the PubMed database by the Entrez system and saved or copied in the text form. Generated reports are interactive. To simplify analysis of extracted relations for users, we provided visualization of the actual sentence from which the relation is extracted, as well as a link to PubMed original document and to such a document with marked crucial words and sentences. Users have to provide e-mail address to which the link to their result files will be given. In Figure 1, we give a snapshot of a report generated from the analysis of PubMed documents ID = 14518567 and ID = 12709435, while in Figure 2, we show the first document with colored segments, as explained above.

Figure 1. A snapshot of the report page generated in the analysis of two PubMed documents is shown.

Figure 1

A snapshot of the report page generated in the analysis of two PubMed documents is shown. First and third columns show the names of TFs, the second column shows the relation word, and the fourth column gives a link to the PubMed document from which information (more...)

Figure 2. An example of a colored PubMed document with identified TF names highlighted in red, with marked sentences from which TF relations are extracted, and with relation words highlighted in blue.

Figure 2

An example of a colored PubMed document with identified TF names highlighted in red, with marked sentences from which TF relations are extracted, and with relation words highlighted in blue. At the bottom of the page, the list of identified TF names in (more...)

DTFRE is a rule-based system. Given a sentence, the system first tags TF names in the input sentences using a prebuilt TF dictionary. Then the sentences are matched with the rules in a rulebase. For every match, slots containing TFs and relation are extracted and presented to the user.

Rulebase Construction

A crucial component of the system is the rulebase that captures the knowledge about the TF-TF relation patterns in the sentences. Construction of the rulebase is a nontrivial task. The traditional method is to hand-code the rules with the help of experts.23-25 However, hand-coding is a tedious task and is error-prone, especially when there are a lot of abstracts to be analyzed. Automatically learning the rules from abstracts is an attractive alternative. Learning extraction rules could help automate the rulebase construction or at least ease the hand-coding process, e.g., by letting the learning method generate seed rules that could be manually refined.

Rule learning from text is an active topic investigated by the Information Extraction (IE) community.26 Though a number of rule learning systems have been proposed,27 directly applying them to extract biological interactions has produced only moderate results.28 The reason is that the complexities in biomedical literature demand learning algorithms customized for the biomedical domain. We have developed one such learning algorithm for constructing the rulebase of our system.

Rule Representation

Our rule learning algorithm uses a disjunctive rule representation. An example of the rule for the “interact” relation is shown in Figure 3. As observed in the figure, the rule consists of several regular patterns connected by the disjunction operator. The regular patterns follow a specific format as below. Every pattern:

  • Has exactly two TFs and one relation word (possibly an inflexion).
  • Has connector words (optional).
  • Has intra-term distance limits.

Figure 3. Conceptual structure of DTFRE.

Figure 3

Conceptual structure of DTFRE.

For example, consider the first regular pattern in Figure 4. It has two TF names and an inflexion of the relation word “interact”. The connector word is “with”, which appears between the relation word and the second TF name. The distances between the adjacent terms stand for the maximum number of wildcard words that could be tolerated between the respective terms, for a rule match.

Figure 4. Sample rules for “interact” relation.

Figure 4

Sample rules for “interact” relation.

Learning Algorithm

Figure 5 presents an outline of our algorithm for learning the disjunctive rules from a training sentence corpus. The algorithm picks a random positive example and attempts to identify a candidate rule pattern that has support and confidence above minimum specified values. All positive examples covered by this pattern are removed, and more candidate rules are generated until all positive examples are covered. The candidate list is then pruned to remove insignificant rules, and the remaining rules represent the learned rules. The algorithm is run for each relation separately, and hence there is at least one rule for every relation. The rulebase is simply a collection of all the learned rules.

Figure 5. Structure of the rule-learning algorithm.

Figure 5

Structure of the rule-learning algorithm.

The learned rules were evaluated using 3-fold cross validation. We obtained over 90% precision and 75% recall on average for the seven types of relation words (‘interact’, ‘complex’, ‘bind’, ‘associate’, ‘synergise’, ‘cooperate’, ‘inhibit’) we used in this system. We spent about 300 hours to manually tune the learned rules and obtained significant increase in accuracy reflected in 93% precision and 88% recall. For comparison, SUISEKI's14 reported performance in extraction of protein-protein interactions is 46% precision and 40% recall, while PreBIND16 performed at 92% precision and 92% recall, but only for a restricted problem of classifying sentences as describing protein-protein interactions or not. PreBIND does not extract the actual relations.

DTFRE is the first public system for TF relation extraction. It achieves accuracy characterized by 93% precision and 88% recall on our test data, which is very similar in performance to that of single-pass manual curation. However, it will be dangerous to extrapolate this performance to an arbitrary set of documents, since the volume of the data used in training and testing is still very small (although it is the largest corpus of manually curated data used for similar tasks in biomedical text-mining). The system is based on a combination of automatic learning for the generation of extraction rules and manual rule tuning. The learning method uses a representation that is human comprehensible, and hence the learned rules are easy to manually verify and tune to achieve best performance. With the rule learning algorithm, we were able to cut down the hand-crafting time considerably. However, the rule representation is shallow and cannot accurately recognize relations expressed in complex sentence structures, e.g., through coreferences. We are addressing this issue as part of our current work.

Mining Associations of Transcription Factors by Dragon TF Association Miner

While DTFRE aims at identifying and actually extracting specific relations and TFs subjected to such relations, the goal of Dragon TF Association Miner (DTFAM) is different. It aims at providing more broad information about potential association of TFs with concepts from Gene Ontology (GO),29 as well as with diseases, in order to help biologists and medical researchers to infer unusual functional associations. The system uses five well-controlled vocabularies. Three vocabularies are related to GO (biological process, molecular function, and cellular component), while the fourth one is related to different disease states. The fifth vocabulary contains TF names. Functional associations of TFs to any term from the four categories (GO and diseases) can be focused to any combination of these terms, such as biological process, or biological process and diseases, etc., depending on the user's selection. All GO vocabularies are general. Disease vocabulary is focused to human diseases, while the TF vocabulary contains over 10,000 TF names and their synonyms collected for various species, mainly eukaryotes, but also including E. coli, B. subtilis and some other prokaryotes. Some necessary data cleaning has been done with all vocabularies in order to enable more efficient text-mining.

DTFAM can be accessed freely for academic and nonprofit users at The system is trained and tested on the previously described corpus of 3000 PubMed documents which were manually classified. The system attempts to assess at the ‘Abstract level’ whether the document analyzed contains information about TF relations or not. The user has possibility to select the level of filtering out irrelevant documents. This function reduces the ‘noise’ (i.e., usage of irrelevant documents) considerably for the generation of final reports. However, it cannot eliminate the irrelevant documents completely.

There are several modules which operate within the system (see fig. 6):

  • The first module analyses the submitted text, makes necessary indexing of terms and generates features for the intelligent module.
  • The second module analyses the content of the processed document and applies one of 65 previously derived models in assessing whether the analyzed document should be retained or rejected. If the model signals that the document contains information about TF relations, the document is accepted for the final analysis, otherwise it is rejected. The selection of models is automatic, and it is determined by the selected sensitivity on the systems main page. The higher the sensitivity, the more documents will be selected for the analysis, but this may also include a large number of irrelevant documents. These 65 models are developed and tuned based on specific feature selection, signal processing, nonlinear modeling, artificial neural networks, and discriminant analyses.
  • The third module generates interactive tabular reports.
  • The fourth module analyses the connections (associations) between the terms and generates interactive association map networks of these terms. The association of terms is based on their co-occurrence in the same PubMed document. The nodes of the generated graphs represent the terms from the selected vocabularies. Different shapes and coloring is used to make it easy for users to analyze these graphs. All nodes are interactive, and by clicking on the node, a set of related PubMed documents with color-marked terms will be opened for the user's inspection and assessment of the relevance of proposed associations.

Figure 6

Figure 6

Schematic presentations of DTFAM structural modules

The main characteristics of the DTFAM system are:

  1. It is focused on exploring potential association of TFs with other important functional categories such as GO terms and diseases.
  2. It provides suitable interactive reports both tabular and graphical.
  3. Its module for filtering irrelevant document has been trained on a unique, large, manually curated corpus of data.
  4. It uses five manually curated vocabularies (one for TF names and synonyms, three for GO categories, and one for diseases).

This system is unique in the combinations of features and utility it provides to the users. The most distinctive features are its focus on transcriptional regulation, its module for filtering out irrelevant documents trained on manually curated large data corpus related to relations between TFs, and possibility for the user to select the stringency of filtering irrelevant documents.

To illustrate how, in a simple way, users can extract useful information with this system, we will assume that it is of interest to find out what TFs are potentially involved in the toll-like, receptor-mediated activation of the signaling pathway which induces an antimicrobial innate immune reaction.30 We also want to find out what the biological processes, molecular functions, cellular components and diseases that could be associated with the found TFs are. Antimicrobial peptides are constitutive ingredients of innate immunity, and they take the role of the first layer of defense of the host against invading pathogens. Some of these peptides are gene products and can be transcriptionally activated. For example, in Drosophila, the Toll signaling pathway regulates rapid production of antimicrobial peptides in response to infection by pathogens. We will perform this exploration by selecting a query ‘toll antimicrobial’. We will also select sensitivity of 0.95, and all four vocabularies at the main page. As a part of the analysis and reports, the system will generate two association map networks. Analysis of the first network depicted in Figure 7 reveals that DFTAM detected inhibitor kappaB (IkappaB), NF-kappaB, and c-Jun as TFs relevant for this signaling pathway. The roles of these three TFs in this pathway are documented.30-32 All other entities found and presented in the network relate to proper GO categories, immune response, and diseases. We also observe that Drosophila TFs, Cactus and Dorsal, have been found. Cactus is an IkappaB-like TF, while Dorsal is an NF-kappaB-like TF. This shows that DTFAM is capable of extracting relevant biological knowledge. We suggest, however, that a user should not blindly accept results of the analysis and should evaluate the relevance of detected associations by consulting the references used by the system. Since the system provides interactive graphs with links to the documents used, as well as the color-highlighted terms used in the analysis, this task is made easier for the user.

Figure 7. The network generated for the task described above.

Figure 7

The network generated for the task described above. TF names are presented by the ellipsoidal nodes with yellow background. Diseases are represented by ellipsoidal nodes with gray background. Terms from GO categories are represented by rhomboidal shapes (more...)

Exploring the Metabolome of Arabidopsis Thaliana and Other Plant Species by Dragon Metabolome Explorer

The largest category of gene functions in all the eukaryotic genomes sequenced thus far is that of metabolism, which can comprise almost 25% of all genes.33 The Metabolome, in its complete sense, includes all metabolic pathways and their components, including the enzymes and the regulators. In this section we present a system, Dragon Metabolome Explorer (DME), for the exploration of metabolic subsystems in plants and their associations with genes and all the GO categories summarized in ontologies adopted by the Arabidopsis research community ( In addition to general GO categories, such as biological processes, molecular functions, and cellular components, Arabidopsis-specific ontologies which DME uses are related to anatomic parts and developmental stages. The exploratory analysis of associations of the GO terms/entities can suggest meaningful functional links and pave the way for a more detailed and focused analysis using experimental approaches. The system is free for academic and nonprofit users and can be accessed at

Metabolic processes control body functions through highly complex networked pathways. Many small molecules associated with such pathways act as regulators of genes and diverse cellular functions. Understanding metabolic processes, therefore, is one of the key issues of modern biology and requires a systems approach due to their complex nature. The limits of metabolic complexity are found in plants due to their extensive secondary metabolism networks; therefore, they form excellent resources for developing knowledge extraction tools which target metabolic pathways. The model plant, Arabidopsis thaliana, has 185 metabolic pathways documented, including over 700 different compounds and nearly 525 enzymes.34,35 However, the information about pathways is not complete, which explains the fact that there are approximately 900 known metabolites found experimentally in Arabidopsis, which are not assigned to any of the known metabolic pathways.

The currently available public resources on metabolic pathways are almost exclusively devoted towards representing the metabolic pathways. Some of these are MetaCyc,36 ENZYME,37 BRENDA,38 IntEnz,39 KEGG,40 PathDB [], UM-BBD,41 and WIT2.42 Some of the above recent resources, such as AraCyc, additionally link the pathway information to the genome resources.35 However, to the best of our knowledge, there are currently no resources for extracting knowledge of the function of metabolites and pathways from the existing literature with the aim to complement the pathway-related information. Our system, DME, is one such bioinformatic tool that can support research in this direction and can simplify the task for individual biologists. It has sufficient flexibility and provides comprehensive, summarized information in a form suitable for simple use by biologists. The information provided is also with high coverage, attempting to include much of the known knowledge.

The algorithm is based on text analysis of PubMed documents. The system uses several highly controlled vocabularies and matches cooccurrence of terms from these dictionaries within a set of documents and determines the significance of each of these terms. It provides users with comprehensive listings of three categories of metabolome components found in the analyzed documents - pathways, enzymes, and metabolites, and any of the categories from the additional three vocabularies specific for Arabidopsis thaliana (related to anatomy, developmental stages, genes), as well as those related to cellular component, biological process, and molecular function. DME attempts to detect potential associations between the terms form these vocabularies and produces different reports, including networks of associations. All reports including graphical ones are interactive and contain hyperlinked nodes to provide PubMed abstracts directly.

There are three possible ways a user can submit documents for the analysis. Documents can be selected by forming any query acceptable to the Entrez search engine of the PubMed repository, or the user can perform PubMed searches in advance and save selected documents in the text format, and then submit such saved documents to the program for the analysis. The second mode is preferable. The system possesses great flexibility as it allows an arbitrary query to be submitted for the abstract selection. The tabular report presents every term from the selected vocabularies found in the document set and links of the PubMed documents where the terms have been found. These terms are also provided in three colors (green, red, and blue for pathways, metabolites, and enzymes, respectively) for easier visual inspection. Graphical report may contain several association networks that depict the terms found in the analysis. Each term is represented by an oval node (green nodes denoting pathways, yellow node with red letters denoting metabolites, and blue nodes denoting enzymes). Again, different colors help easier inspection of the generated association networks.

We illustrate here how this system can be used to infer function related to metabolic pathways. To do so, we take the example of the pathways, metabolites, enzymes, and plant anatomy terms associated with the activity of the metabolite, dihydrokaempferol. The query was “dihydrokaempferol”. The selected vocabularies were: pathways, metabolites, enzymes, and anatomy. The system produced an interactive tabular report, part of which is shown in Figure 8, and an interactive association map network is depicted in Figure 9.

Figure 8. Part of the interactive tabular report of DME using the term “dihydrokaempferol”.

Figure 8

Part of the interactive tabular report of DME using the term “dihydrokaempferol”.

Figure 9. A part of the DME association map network generated by the query shown in Figure 8.

Figure 9

A part of the DME association map network generated by the query shown in Figure 8. Pathways, enzymes, metabolites and plant anatomy terms are shown in different shaped and colored nodes.

DME found the anthocyanin biosynthesis pathway, of which dihydrokaempferol is part. It also found 18 compounds, 9 enzymes, and 8 anatomy parts where this metabolite is found. From components of the anthocyanin biosynthesis pathway, DME identified 5 of 8 metabolites and 2 of 3 enzymes. DME also displayed 5 of 10 metabolites and 3 of 9 enzymes which are present in the flavonoid biosynthesis pathway. Some of these metabolites and enzymes are shared between these two pathways, suggesting links between anthocyanin biosynthesis and flavonoid biosynthesis pathways. These two pathways fall under the more general phenylpropanoid pathway. Also, for example, DME has found that flower is related to dihydroflavonol 4-reductase, flavonoid 3'-hydroxylase, dihydrokaempferol, leucopelargonidin, and anthocyanin biosynthesis. It is, however, documented that the anthocyanin biosynthesis pathway is involved in flower pigmentation.43 Additionally, the above mentioned enzymes and metabolites from anthocyanin biosynthesis are involved in flower pigmentation (dihydroflavonol 4-reductase does not show pigmentation of flower due to the accumulation of dihydrokaempferol;43 flavonoid 3'-hydroxylase shows pigmentation of anthocyanins;44 leucopelargonidin gives the orange color to flowers). Moreover, dihydroflavonol 4-reductase inefficiently reduces dihydrokaempferol in anthocyanin biosynthesis45, and DME linked these together. These several extracts from the reports of DME are used to illustrate that one can infer many specific issues related to function of metabolic subsystems.

Comparative Analysis of Bacterial Species

One of the interesting possibilities is the use of text-mining in the cross-species studies. The aim of such tasks is to find out in an automated fashion the facts common to two or more species, as well as those specific for individual species or group of species. For example, we may be interested in finding common parts of complex regulatory networks and pathways which are preserved in various species (and thus common), as well as to find out gene networks characteristic of separate species development related to the same or similar pathways. Due to the putative nature of text-mining, this approach is highly useful in suggesting functional associations between the entities searched in a given framework.

Our system, Dragon Explorer of Bacterial Genomes (DEBG), has currently data for two bacteria, Pseudomonas aeruginosa and Escherichia coli. We plan to extend it to other microbes shortly. DEBG contains species-specific vocabularies of genes (and their synonyms), each containing several thousand entries. The system analyses co-occurrence of terms from these vocabularies in several groups of carefully selected documents from the PubMed repository and summarizes the results obtained. Then it provides interactive graphical and tabular presentations of the associations found. DEBG relies on a local installation of PubMed.

Users can supply up to three concepts to be used in the selection of documents. One of the concepts should be broad, while the other two should be more specific. DEBG will automatically form several queries to collect documents required for the analysis. We will illustrate this through a particular example. To illustrate the use of DEBG, let us assume that we are interested in exploring the differences in gene networks controlling flagellar motility and twitching motility46-48 in our two bacterial species. We can select ‘motility’ as the broad category, while ‘flagella’ and ‘twitching OR fimbriae OR pili’ can be selected for the more specific categories. The reason we added ‘fimbriae’ and ‘pili’ is because twitching motility in E. coli is commonly associated with fimbriae, while in P. aeruginosa, it is the type VI pili. Example of queries which DEBG forms are as follows: Q1: (Pseudomonas OR ‘Escherichia coli’) AND flagella Q2: (Pseudomonas OR ‘Escherichia coli’) AND (twitching OR fimbriae OR pili) Q3: (Pseudomonas OR ‘Escherichia coli’) AND motility

After the documents are collected, the system identifies the existing terms from the vocabularies used, indexes all found terms as belonging to one or another organism, or both, and also the category (motility, flagella, twitching). Based on these summary results, the system generates interactive tabular and graphical reports.

The system provides a colored output for the different groups of genes found in documents specific to queries Q1 and Q2. In the gene association networks, the nodes represent the genes. Genes specific to one species are shown with one shape of node, while those form the other species with different shapes of nodes. Genes found in the documents in response to query Q3 are also depicted as a separately shaped nodes and in different colors. This different coloring and shapes of nodes related to categories and species make inspection and analysis of the found networks much easier for biologists.

For example, in Figure 10, one may observe that genes in the generated network appear in three big groups, one yellow-colored corresponding to twitching motility, the other magenta-colored corresponding to flagellar motility, and the third one green colored corresponding to genes contained in documents related to ‘motility’, but not directly related to flagellar or twitching motility. One can easily track the association of genes to species, as well as potential associations of other found genes supposedly involved in motility, but not necessarily associated with the two specific types of movement. An interesting observation is that fliC (fig. 10, bottom panel), the most abundant structural component of the flagellar apparatus, is linked extensively to other genes in the network, indicating its importance in the formation of the flagella. Also, genes with related functions are likely to be located in close proximity, for example, the che genes involved in chemotaxis are clustered together in the network.

Figure 10. A part of a complex network of gene associations based on textual searches and related to ‘motility’, ‘flagellar motility’, and ‘twitching motility’ in P.

Figure 10

A part of a complex network of gene associations based on textual searches and related to ‘motility’, ‘flagellar motility’, and ‘twitching motility’ in P. aeruginosa and E. coli. The flagellar structural (more...)

The conclusions of this in silico experiment, which included 3522 documents in total, are that in a a relatively simple fashion and in a short time, we are capable of summarizing a part of information regarding these two types of movements in two bacteria and obtain rich material for further detailed analysis.


We show here that text-mining is a useful technology that can support research in life sciences and allow easier inferences of function of examined entities. The strength of this approach is its comprehensiveness and ability to present sometimes unexpected associations of categories and terms based on analysis of large sets of documents. This is not feasible for a single user. However, this also is a weakness, since very few text-mining systems have built-in intelligence to automatically determine the relevance from the document context. The accuracy of such intelligent blocks is currently not sufficiently high, which requires that users to carefully analyze the results obtained. However, the developments of natural language processing will make crucial contributions to this growing field in the future.49


Wheeler DL, Church DM, Edgar R. et al. Database resources of the National Center for Biotechnology Information: Update. Nucleic Acids Res. 2004;32:D35–40. [PMC free article: PMC308807] [PubMed: 14681353]
Dickman S. Tough Mining: The challenges of searching the scientific literature. PLoS Biol. 2003;1(2):E48. [PMC free article: PMC261887] [PubMed: 14624250]
de BruijnB, Martin J. Getting to the (c)ore of knowledge: Mining biomedical literature. Int J Med Inf. 2002;67(1-3):7–18. [PubMed: 12460628]
Grivell L. Mining the bibliome: Searching for a needle in a haystack? New computing tools are needed to effectively scan the growing amount of scientific literature for useful information. EMBO Rep. 2003;3(3):200–203. [PMC free article: PMC1084023] [PubMed: 11882534]
Andrade MA, Bork P. Automated extraction of information in molecular biology. FEBS Lett. 2000;476(1-2):12–17. [PubMed: 10878241]
Schulze-Kremer S. Ontologies for molecular biology and bioinformatics. In Silico Biol. 2002;2(3):179–193. [PubMed: 12542404]
Jenssen TK, Laegreid A, Komorowski J. et al. A literature network of human genes for high-throughput analysis of gene expression. Nat Genet. 2001;28(1):21–28. [PubMed: 11326270]
Tanabe L, Scherf U, Smith LH. et al. An Internet text-mining tool for biomedical information, with application to gene expression profiling Biotechniques 199927(6):1210–4.(1216-7) [PubMed: 10631500]
Perez-Iratxeta C, Perez AJ, Bork P. et al. Update on XplorMed: A web server for exploring scientific literature. Nucleic Acids Res. 2003;31(13):3866–3868. [PMC free article: PMC168945] [PubMed: 12824439]
Becker KG, Hosack DA, Dennis Jr G. et al. PubMatrix: A tool for multiplex literature mining. BMC Bioinformatics. 2003;4(1):61. [PMC free article: PMC317283] [PubMed: 14667255]
Asher B. Decision analytics software solutions for proteomics analysis. J Mol Graph Model. 2000;18:79–82. [PubMed: 10935212]
Hosack DA, Dennis G, Sherman BT. et al. Identifying biological themes within lists of genes with EASE. Genome Biology. 2003;4:R70. [PMC free article: PMC328459] [PubMed: 14519205]
Kim SK, Lund J, Kiraly M. et al. A gene expression map for Caenorhabditis elegans. Science. 2001;293:2087–2092. [PubMed: 11557892]
Blaschke C, Valencia A. The potential use of SUISEKI as a protein interaction discovery tool. Genome Inform Ser Workshop Genome Inform. 2001;12:123–34. [PubMed: 11791231]
Chiang JH, Yu HC, Hsu HJ. GIS: A biomedical text-mining system for gene information discovery. Bioinformatics. 2004;20(1):120–121. [PubMed: 14693818]
Donaldson I, Martin J, de Bruijn B. et al. PreBIND and Textomy—mining the biomedical literature for protein-protein interactions using a support vector machine. BMC Bioinformatics. 2003;4(1):11. [PMC free article: PMC153503] [PubMed: 12689350]
Perez-Iratxeta C, Bork P, Andrade MA. Association of genes to genetically inherited diseases using data mining. Nature Genetics. 2002;31:316–319. [PubMed: 12006977]
Chiang JH, Yu HC. MeKE: Discovering the functions of gene products from biomedical literature via sentence alignment. Bioinformatics. 2003;19(11):1417–1422. [PubMed: 12874055]
Srinivasan P. MeSHmap: A text mining tool for MEDLINE. Proc AMIA Symp. 2001;642-646 [PMC free article: PMC2243391] [PubMed: 11825264]
Lee TI, Rinaldi NJ, Robert F. et al. Transcriptional regulatory networks in saccharomyces cerevisiae. Science. 2002;298:799–804. [PubMed: 12399584]
Werner T, Fessele S, Maier H. et al. Computer modeling of promoter organization as a tool to study transcriptional co regulation. FASEB J. 2003;17(10):1228–37. [PubMed: 12832287]
Kel-Margoulis OV, Kel AE, Reuter I. et al. A database on composite regulatory elements in eukaryotic genes. Nucleic Acids Res. 2002;30(1):332–4. [PMC free article: PMC99108] [PubMed: 11752329]
Thomas J, Milward D, Ouzounis C. et al. Automatic extraction of protein interactions from scientific abstracts. Pacific Symposium on Biocomputing. 2000;5:538–549. [PubMed: 10902201]
Blaschke C, Valencia A. The frame-based module of the Suiseki information extraction system. IEEE Intelligent Systems. 2002;17:14–20.
Ono T, Hishigaki H, Tanigami A. et al. Automated extraction of information on protein-protein interactions from the biological literature. Bioinformatics. 2001;17(2):155–161. [PubMed: 11238071]
Appelt DE, Israel D. Introduction to information, extraction technology Proc of International Joint Conference on Artificial Intelligence (IJCAI-99), Stockholm, Sweden 1999. (URL http// wwwaisricom/~appelt/ie-tutorial/)
Muslea I. Extracting patterns for information extraction tasks: A survey The AAAI Workshop on Machine Learning for Information Extraction 1999. (URL http//wwwaisricom/~muslea/ papershtml)
Bunescu R, Ge RF, Kate RJ. et al. Learning to extract proteins and their interactions from medline abstracts. Proceedings of the ICML-2003 Workshop on Machine Learning in Bioinformatics. 2003:46–53.
Harris MA, Clark J, Ireland A. et al. Gene Ontology Consortium. The Gene Ontology (GO) database and informatics resource. Nucleic Acids Res. 2004;32:D258–61. [PMC free article: PMC308770] [PubMed: 14681407]
Telepnev M, Golovliov I, Grundstrom T. et al. Francisella tularensis inhibits Toll-like receptor-mediated activation of intracellular signaling and secretion of TNF-alpha and IL-1 from murine macrophages. Cell Microbiol. 2003;5(1):41–51. [PubMed: 12542469]
Takeuchi O, Akira S. Toll-like receptors; their physiological role and signal transduction system. Int Immunopharmacol. 2001;1(4):625–35. [PubMed: 11357875]
Lee SJ, Lee S. Toll-like receptors and inflammation in the CNS. Curr Drug Targets Inflamm Allergy. 2002;1(2):181–91. [PubMed: 14561199]
The Arabidopsis genome initiative, analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature. 2000;408:796. [PubMed: 11130711]
Mueller. AraCyc: A biochemical pathway database for arabidopsis. Plant Physiol. 2003;132:453– 460. [PMC free article: PMC166988] [PubMed: 12805578]
Rhee SYl. The Arabidopsis Information Resource (TAIR): A model organism database providing a centralized, curated gateway to Arabidopsis biology, research materials and community. Nucleic Acids Res. 2003;31:224–228. [PMC free article: PMC165523] [PubMed: 12519987]
Krieger CJ, Zhang P, Mueller LA. et al. MetaCyc: A multiorganism database of metabolic pathways and enzymes. Nucleic Acids Res. 2004;32:D438–442. [PMC free article: PMC308834] [PubMed: 14681452]
Bairoch A. The ENZYME database in 2000. Nucleic Acids Res. 2000;28:304–305. [PMC free article: PMC102465] [PubMed: 10592255]
Pharkya P, Nikolaev EV, Maranas CD. Review of the BRENDA database. Metab Eng. 2003;5(2):71– 3. [PubMed: 12850129]
Fleischmann A, Darsow M, Degtyarenko K. et al. IntEnz, the integrated relational enzyme database. Nucleic Acids Res. 2004;32:D434–7. [PMC free article: PMC308853] [PubMed: 14681451]
Kanehisa M, Goto S, Kawashima S. et al. The KEGG resource for deciphering the genome. Nucleic Acids Res. 2004;32:D277–80. [PMC free article: PMC308797] [PubMed: 14681412]
Ellis LB, Hershberger CD, Bryan EM. et al. The University of Minnesota biocatalysis/biodegradation database: Emphasizing enzymes. Nucleic Acids Res. 2001;29(1):340–3. [PMC free article: PMC29774] [PubMed: 11125131]
D'Souza M, Romine MF, Maltsev N. SENTRA, a database of signal transduction proteins. Nucleic Acids Res. 2000;28(1):335–6. [PMC free article: PMC102390] [PubMed: 10592266]
Johnson ET, Yi H, Shin B. et al. Cymbidium hybrida dihydroflavonol 4-reductase does not efficiently reduce dihydrokaempferol to produce orange pelargonidin-type anthocyanins. Plant J. 1999;19(1):81–5. [PubMed: 10417729]
Owens DK, Hale T, Wilson LJ. et al. Quantification of the production of dihydrokaempferol by flavanone 3-hydroxytransferase using capillary electrophoresis. Phytochem Anal. 2002;13(2):69–74. [PubMed: 12018025]
Prescott AG, Stamford NP, Wheeler G. et al. In vitro properties of a recombinant flavonol synthase from Arabidopsis thaliana. Photochemistry. 2002;60(6):589–93. [PubMed: 12126705]
Macnab RM. How bacteria assemble flagella. Annu Rev Microbiol. 2003;57:77–100. [PubMed: 12730325]
Wall D, Kaiser D. Type VI pili and cell motility. Mol Microbiol. 1999;32:1–10. [PubMed: 10216854]
Bardy SL, Ng SYM, Jarrell KF. Prokaryotic motility structures. Microbiology. 2003;149:295–304. [PubMed: 12624192]
Manning CD, Schutze H. Foundations of statistical natural language processing. MIT Press. 1999
Copyright © 2000-2013, Landes Bioscience.
Bookshelf ID: NBK6188
PubReader format: click here to try


  • PubReader
  • Print View
  • Cite this Page

Related information

  • PMC
    PubMed Central citations
  • PubMed
    Links to pubmed

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...