![]() | ![]() |
Formats:
|
||||||||||||||||||||||||||
Copyright © 2008 Poulter et al; licensee BioMed Central Ltd. MScanner: a classifier for retrieving Medline citations 1UCT NBN Node, Department of Molecular and Cell Biology, University of Cape Town, Cape Town, South Africa 2Stanford Medical Informatics, Stanford University, San Francisco, USA 3Department of Bioengineering and Department of Genetics, Stanford University, San Francisco, USA Corresponding author.Graham L Poulter: graham.poulter/at/gmail.com; Daniel L Rubin: dlrubin/at/stanford.edu; Russ B Altman: russ.altman/at/stanford.edu; Cathal Seoighe: cathal.seoighe/at/uct.ac.za Received September 7, 2007; Accepted February 19, 2008. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. This article has been cited by other articles in PMC.Abstract Background Keyword searching through PubMed and other systems is the standard means of retrieving information from Medline. However, ad-hoc retrieval systems do not meet all of the needs of databases that curate information from literature, or of text miners developing a corpus on a topic that has many terms indicative of relevance. Several databases have developed supervised learning methods that operate on a filtered subset of Medline, to classify Medline records so that fewer articles have to be manually reviewed for relevance. A few studies have considered generalisation of Medline classification to operate on the entire Medline database in a non-domain-specific manner, but existing applications lack speed, available implementations, or a means to measure performance in new domains. Results MScanner is an implementation of a Bayesian classifier that provides a simple web interface for submitting a corpus of relevant training examples in the form of PubMed IDs and returning results ranked by decreasing probability of relevance. For maximum speed it uses the Medical Subject Headings (MeSH) and journal of publication as a concise document representation, and takes roughly 90 seconds to return results against the 16 million records in Medline. The web interface provides interactive exploration of the results, and cross validated performance evaluation on the relevant input against a random subset of Medline. We describe the classifier implementation, cross validate it on three domain-specific topics, and compare its performance to that of an expert PubMed query for a complex topic. In cross validation on the three sample topics against 100,000 random articles, the classifier achieved excellent separation of relevant and irrelevant article score distributions, ROC areas between 0.97 and 0.99, and averaged precision between 0.69 and 0.92. Conclusion MScanner is an effective non-domain-specific classifier that operates on the entire Medline database, and is suited to retrieving topics for which many features may indicate relevance. Its web interface simplifies the task of classifying Medline citations, compared to building a pre-filter and classifier specific to the topic. The data sets and open source code used to obtain the results in this paper are available on-line and as supplementary material, and the web interface may be accessed at http://mscanner.stanford.edu. Background Ad-hoc information retrieval Information retrieval on the biomedical literature indexed by Medline [1] is most often carried out using ad-hoc retrieval. The PubMed [2] boolean search engine is the most widely used Medline retrieval system. Other interfaces to searching Medline include relevance ranking systems such as Relemed [3] and systems such as EBIMed [4] that perform information extraction and clustering on results. Certain web search engines such as Google Scholar [5] also index much of the same literature as Medline. Alternatives to ordinary queries include the related articles feature of PubMed [6], which returns the Medline records most similar to a given record of interest, and the eTBlast [7] search engine which ranks Medline abstracts by their similarity to a given paragraph of text. Supervised learning for database curation Ad-hoc retrieval in general has proven inefficient for the task of identifying articles relevant to databases that require manual curation of entries from biomedical literature, such as the Pharmacogenetics Knowledgebase (PharmGKB) [8], and for constructing corpora for automated text mining systems such as Textpresso [9,10]. It is difficult to design an expert boolean query (the knowledge engineering approach to document classification [11]) that recalls most of the relevant documents without retrieving many irrelevant documents at the same time, when there are many document features that potentially indicate relevance. The case of many relevant features is, however, effectively handled using supervised learning, in which a text classifier is inductively trained from labelled examples [12,13]. Several databases have therefore used supervised learning to filter Medline for relevant documents [14], a recent example being the Immune Epitope Database (IEDB) [15]. IEDB researchers first used a sensitive PubMed query several pages in length to obtain a Medline subset of 20,910 records. The components of the query had previously been used by IEDB curators, whose manual relevance judgements formed a "gold standard" training corpus of 5,712 relevant and 15,198 irrelevant documents. Different classifier algorithms and document representations were evaluated under cross validation, and their performance compared using the area under the Receiver Operating Characteristic (ROC) curve [16]. The best of the trained classifiers is to be applied to future results of the sensitive query to reduce the number of documents that have to be manually reviewed. Supervised learning has also been used to identify Medline records relevant to the Biomolecular Interaction Network Database [17], the ACP Journal Club for evidence based medicine [18], the Textpresso resource [9], and the Database of Interacting Proteins (DIP) [19]. Classification may also be performed on full-text articles as in the TREC 2005 Genomics Track [20], and Cohen [21] provides a general-purpose classifier for the task. Most classifiers have been developed for filtering sets of a few thousand Medline records, but it is possible to classify larger subsets of Medline and even the whole Medline database. A small number of methods have been developed for larger data sets, including an ad-hoc scoring method that has been tested on a stem cell subset of Medline [22], the PharmGKB curation filter [23], and the PubFinder [24] web application derived from the DIP curation filter [19]. However, tasks submitted to the PubFinder site in mid-2006 are still processing and the maintainers are unreachable. In some cases, text mining for relationships between named entities is used instead of supervised learning to judge relevance – for example in the more recent curation filter developed for the DIP [25]. The most closely related articles [6] to individual articles in a collection have also been used to update a bibliography [26] or a database [27]. Comparison of information retrieval approaches Approaches to retrieving relevant Medline records for database curation have included ad-hoc retrieval (boolean retrieval in particular), related article search, and supervised learning. Pure boolean retrieval systems like PubMed return (without ranking) all documents that satisfy the logical conditions specified in the query. The vector space models used by web search engines rank documents by similarity to the query, and probabilistic retrieval models rank documents by decreasing probability of relevance to the topics in the query [28]. Related article search retrieves documents by their similarity to a query document, which can be accomplished by using the document as a query string in a ranking ad-hoc retrieval system tuned for long queries [6,29]. Overlap in citation lists has also been used as a benchmark for relatedness [29]. The method used in PubMed related articles [6] directly evaluates the similarity between a pair of documents over all topics (corresponding to vocabulary terms) using a probabilistic model. Supervised learning trains a document classifier from labelled examples, framing the problem of Medline retrieval as a problem of classifying documents into the categories of "relevant" and "irrelevant". Classifiers may either produce ranked outputs or make hard judgements like a boolean query [12]. Statistical classifiers, such as the Naïve Bayes classifier used here, use the same Probability Ranking Principle as probabilistic ad-hoc retrieval systems [28]. Ranked classifier results may loosely be considered to contain documents closely related to the relevant examples as a whole. Overview of MScanner We have developed MScanner, a classifier of Medline records that uses supervised learning to identify relevant records in a non-domain-specific manner. The user provides only relevant citations as training examples, with the rest of Medline approximating the irrelevant examples for training purposes. Most classifiers are developed for particular databases, a limitation that we address by demonstrating effectiveness in multiple domains and providing facilities to evaluate the classifier on new inputs. We make it easier to use text classification by providing a web interface and operating on all of Medline instead of a Medline subset. To attain the high speeds necessary for online use, we used an optimised implementation of a Naïve Bayes classifier, and a compact document representation derived from two feature spaces in the Medline record metadata, namely the Medical Subject Headings (MeSH) and the journal of publication (ISSN). The choice of the MeSH feature space is informed by a previous study [23], in which classification using MeSH features performed well on PharmGKB citations. We describe the use of the classifier, present example cross validation results, and evaluate the classifier on a gold standard data set derived from an expert PubMed query. Results Web interface workflow The web interface, shown in Figure Figure1,1
The results pages, an example of which is shown in Figure Figure2,2
The submission form allows some of the classifier parameters to be adjusted. These include setting an upper limit on the number of results, or restricting Medline to records completed after a particular date (useful when monitoring for new results). More specialised options include the estimated fraction of relevant articles in Medline (prevalence), and the minimum score to classify an article as relevant. Higher estimated prevalence produces more results by raising the prior probability of relevance (see Methods), while higher prediction thresholds return fewer results, for greater overall precision at the cost of recall. Cross validation protocol The web interface provides a 10-fold cross validation function. The input examples are used as the relevant corpus, and up to 100,000 PubMed IDs are selected at random from the remainder of Medline to approximate an irrelevant corpus. In each round of cross validation, 90% of the data is used to estimate term frequencies, and the trained classifier is used to calculate article scores for the remaining 10%. Graphs derived from the cross validated scores include article score distributions, the ROC curve [16] and the curve of precision as a function of recall. Metrics include area under ROC and average precision [30]. Below, we applied cross validation to training examples from three topics (detailed in Methods) and one control corpus, to illustrate different use cases. The PG07 corpus consists of 1,663 pharmacogenetics articles, for the use case of curating a domain-specific database. The AIDSBio corpus consists of 10,727 articles about AIDS and bioethics, for the case of approximating a complex query or extending a text mining corpus. The Radiology corpus consists of 67 articles focusing on splenic imaging, for the case of extending a personal bibliography. The Control corpus consists of 10,000 randomly selected citations, and exists to demonstrate worst-case performance when the input has the same term distribution as Medline. We derived the irrelevant corpus for each topic from a single corpus, Medline100K, of 100,000 random Medline records. For each topic, we create the irrelevant corpus by taking Medline100K and subtracting any overlap with the relevant training examples. This differs from the web interface, which generates an independent irrelevant corpus every time it is used. A summary of the cross validation statistics for the sample topics is presented in Table 2.
Distributions of article scores The article score distributions for relevant and irrelevant documents for each topic are shown in Figure Figure3.3
Receiver Operating Characteristic The ROC curve [16] for each topic is shown in Figure Figure4.4
Precision under cross validation We evaluated cross validation precision at different levels of recall in Figure Figure5,5
Performance in a retrieval situation To evaluate classification performance in a retrieval situation we compared the performance of MScanner to the performance of an expert PubMed query that was used to identify articles for the Immune Epitope Database (IEDB). We made use of the 20,910 results of a sensitive expert query that had been manually split into 5,712 relevant and 15,198 irrelevant articles for the purpose of training the IEDB classifier [15]. MeSH terms were available for 20,812 of the articles, of which 5,680 were relevant and 15,132 irrelevant. The final data set is provided in Additional File 2. To create training and testing corpora, we first restricted Medline to the 783,028 records completed in 2004, a year within the date ranges of all components of the IEDB query. For relevant training examples we used the 3,488 relevant IEDB results from before 2004, and we approximated irrelevant training examples using the whole of 2004 Medline. We then used the trained classifier to rank the articles in 2004 Medline. We compared precision and recall as a function of rank for MScanner and the IEDB boolean query in Figure Figure6,6
Performance/speed trade-off We also compared MScanner to the IEDB classifier on its cross validation data, to evaluate the trade-off between performance and speed. The IEDB uses a Naïve Bayes classifier with word features derived from a concatenation of abstract, authors, title, journal and MeSH, followed by an information gain feature selection step and extraction of domain-specific features (peptides and MHC alleles). Using cross-validation to calculate scores for the collection of 20,910 documents, the IEDB classifier obtained an area under ROC curve of 0.855, with a classification speed (after training) of 1,000 articles per 30 seconds. MScanner, using whole MeSH terms and ISSN features, obtained an area under ROC of 0.782 ± 0.003, with a classification speed of approximately 15 million articles per 30 seconds. However, the prior we used for frequency of term occurrence (see Methods) is designed for training data where the prevalence of relevant examples is low. The prevalence of 0.27 in the IEDB data is much higher than the prevalences in Table 2, and using the Laplace prior here would improve the ROC area to 0.825 ± 0.003 but degrade performance in cross validation against Medline100K. The remaining difference in ROC between MScanner and the IEDB classifier reflects information from the abstract and domain-specific features not captured by the MeSH feature space. All ROC AUC values on the IEDB data are much lower than in the sample cross validation topics. This is because it is more difficult to distinguish between relevant and irrelevant articles among the closely related articles resulting from an expert query, than to distinguish relevant articles from the rest of Medline. Discussion Uses of supervised learning for Medline retrieval Supervised learning has already been applied to the problem of database curation and the development of text mining resources. However, using a web service like MScanner to perform supervised learning is a simple operation compared to constructing a boolean filter, gold standard training set, and custom-built classifier. MScanner may supplement existing workflows that use a pre-filter query by detecting relevant articles inadvertently excluded by the filter. Another possibility is using MScanner in place of a filter query when one is unavailable, and confirming relevance by passing on the results to a stronger classifier or an information extraction method such as that used by the Database of Interacting Proteins [25]. Supervised learning can also be used in other scenarios where relevant training examples are readily available and the presence of many relevant features hinders ad-hoc retrieval. For example, individual researchers could leverage the documents in a personal bibliography to identify additional articles relevant to their research interests. Performance evaluation MScanner's performance varies by topic, depending on the degree to which features are enriched or depleted in relevant articles compared to Medline. The relative performance on different corpora also depends on the evaluation metric used. For example, ROC performance on PG07 shows lower overall ability to distinguish pharmacogenetics articles from Medline, but the right hand sub-plot of Figure Figure44 The score distributions for the Control corpus (Figure (Figure3)3 Document representations We represented Medline records as binary feature vectors derived from MeSH terms and journal ISSNs. These are separate feature spaces: a MeSH term and ISSN consisting of the same string would be not be considered the same feature. Medline provides each MeSH term in a record as a descriptor in association with zero or more qualifiers, as in "Nevirapine/administration & dosage". To reduce the dimensionality of the feature space we treat the descriptor and qualifier as separate features. We detected 24,069 distinct MeSH features in use, and 17,191 ISSN features, for an average of 13.5 features per record. The 2007 MeSH vocabulary comprises 24,357 descriptors and 83 qualifiers. Of the journals, about 5,000 are monitored by PubMed and the rest are represented by a only few records each. An advantage of the MeSH and ISSN feature spaces is that they allow a compact document representation using 16-bit features, which increases classification speed. MeSH is also a controlled vocabulary, and so does not have word sense ambiguities like free text. However the vocabulary does not cover all concepts, and covers some areas of biology and medicine (such as medical terminology) more densely than others. Also, not every article has all relevant MeSH terms assigned, and there is a tendency for certain terms to be assigned to articles that just discuss the topic, such as articles "about dental research" rather than dental research articles themselves [34]. Performance can be improved by adding an additional space of binary features derived from the title and abstract of the document. Not relying solely on MeSH features would also enable classification of Medline records that have not been assigned MeSH descriptors yet. The additional features would, however, reduce classification speed due to larger document representations, introduce redundancy with the MeSH feature space, and require a feature selection step. The IEDB classifier [15] avoids redundancy by concatenating the abstract with the MeSH terms and using a single feature space of text words. Binary features should model short abstracts relatively well, although performance on longer texts is known to benefit from considering multiple occurrences of terms [35,36]. MeSH annotations and journal ISSNs are domain-specific resources in the biomedical literature. The articles cited by a given article (although not provided in Medline) are another domain-specific resource that may prove useful in retrieval tasks, in addition to their uses in navigating the citation network. For example, the overlap in citation lists has been used as a benchmark for article relatedness [29]. In supervised learning, it may be possible to incorporate the number of co-citations between a document and relevant articles, or to use the citing of an article as a binary feature. Conclusion MScanner inductively learns topics of interest from example citations, with the aim of retrieving a large number of topical citations more effectively than with boolean queries. It represents an advance on previous tools for Medline classification by performing well across a range of topics and input sizes, by making available implementation source code, and by operating on all of Medline fast enough to use over a web interface. As a non-domain-specific classifier, it has a facility for performing cross validation to obtain ROC and precision statistics on new inputs. MScanner should be useful as a filter for database curation where a sensitive filter query and customised classifier are not already available, and in general for constructing large bibliographies, text mining corpora and other domain-specific Medline subsets. Methods Bayesian classification MScanner uses a Naïve Bayes classifier, which places documents in the class with the greatest posterior probability, and is derived by assuming that feature occurrences are conditionally independent with respect to the class variable. In the multivariate Bernoulli document model [35], each document is represented as a binary vector, f = (f1, f2,...,fk), with 1 or 0 specifying the presence or absence of each feature. The score of the article is the logarithm of the posterior probability ratio for the article being relevant versus irrelevant, which reduces to a sum of feature support scores and a prior score: The feature support scores [37] are: The greatest support scores for occurring features are shown in Table 1, when the classifier has been trained to perform PG07 retrieval. For computational efficiency, the non-occurrence support scores, Y(Fi = 0), are simplified to a base score (of an article with no features) and a small adjustment for each feature that occurs. We estimate the prior probability of relevance P(R) using the number of training examples divided by the number of articles in Medline, and the classifier predicts relevance for articles with S(f) ≥ 0. The prior and minimum score for predicting relevance may also be set on the web interface. Estimation of feature frequencies We use posterior estimates for p(Fi = fi|R) and p(Fi = fi| And similarly for p(Fi = 1| Data structures enabling fast classification MScanner's classification speed is due to the use of a Bayesian classifier, a compact feature space, and a customised implementation. Training in retrieval tasks is made much faster by keeping track of the total number of occurrences of each term in Medline. The MeSH and ISSN feature spaces fit in 16-bit feature IDs, and each Medline record has an average of 13.5 features. Including some overhead, this allows the features of all 16 million articles in Medline to be stored in a binary stream of around 600 MB. A C program takes 32 seconds to parse this file and calculate article scores for all of Medline, returning those above the specified threshold. The rest of the program is written in Python [38], using the Numpy library for vector operations. Source code is provided in Additional File 3. For storing complete Medline records, we used a 22 GB Berkeley DB indexed by PubMed ID. It was generated by parsing the Medline Baseline [39] distribution, which consists of 70 GB XML compressed to 7 GB and split into files of 30,000 records each. During parsing, a count of the number of occurrences of each feature in Medline is maintained, ready to be used for training the classifier. To look up feature vectors in cross validation, we use a 1.3 GB Berkeley DB instead of the binary stream. Construction of PG07, AIDSBio, Radiology and Medline100K The PG07, AIDSBio and Radiology corpora provided in Additional File 4 are from different domains and are of different sizes, to illustrate the different use cases mentioned in the results. The PG07 corpus comprises literature annotations taken from the PharmGKB [8] on 5 February 2007. The AIDSBio corpus is the intersection of the PubMed AIDS [40] and Bioethics [41] subsets on 19 October 2006. The Radiology corpus is a bibliography of 67 radiology articles focusing on the spleen, obtained from a co-worker of DR's. The corpora exclude records that do not have status "MEDLINE", and thus lack MeSH terms. The Medline100K corpus consists of 100,000 randomly selected Medline records, with completion dates up to 21 January 2007, which is also the upper date for the Control corpus of 10,000 random citations. The size of Medline100K was chosen to provide a good approximation of the Medline background, while containing few unknown relevant articles. Availability and requirements • Project Name: MScanner • Home Page: http://mscanner.stanford.edu • Operating Systems: Platform independent • Programming Languages: Python, JavaScript, C • Minimum Requirements: Internet Explorer 7, Mozilla Firefox 2, Opera 9, or Safari 3 • License: GNU General Public License Authors' contributions GP and CS in collaboration with DR and RA conceived of the goals for MScanner, including a web interface and refining the classifier formulation. GP programmed the MScanner software and web interface, developed and carried out experiments to analyse MScanner's performance with feedback from CS, DR and RA, and wrote the manuscript drafts. CS supervised the research and reviewed all drafts of the manuscript. All authors read and approved the final draft of the paper. Additional file 1 11-point precision-recall curves. 11pointcurves.pdf is a PDF file containing a table of 11-point interpolated precision curves for all experiments in the paper. The interpolated precision at a specified recall is the highest precision found for any value of recall greater than or equal to the specified recall. Click here for file(44K, pdf) Additional file 2 Corpora used in the IEDB comparison. iedb.zip is a ZIP archive containing text files, where each line contains the PubMed ID and completion date of a Medline record. iedb-all-relevant.txt and iedb-all-irrelevant.txt are the relevant and irrelevant cross validation corpora used in the IEDB cross validation. iedb-pre2004-relevant.txt are the relevant training examples for the retrieval comparison. iedb-2004-relevant.txt and iedb-2004-irrelevant.txt are the manually evaluated IEDB query results from 2004 Medline. PubMed IDs for 2004 Medline may be obtained using the PubMed query 2004 [DateCompleted] AND medline [sb]. Click here for file(107K, zip) Additional file 3 Source code for MScanner. mscanner-20071123.zip is a ZIP archive containing the Python 2.5 source code for MScanner, licensed under the GNU General Public License. It also contains API documentation in HTML format. Updated versions will be made available at http://mscanner.stanford.edu. Click here for file(909K, zip) Additional file 4 Sample cross validation corpora. corpora.zip is a ZIP archive containing text files for the PG07, AIDSBio, Radiology, Control and Medline100K sample corpora. Each line contains the PubMed ID and completion date of a Medline record. Click here for file(442K, zip) Acknowledgements This work is supported by the University of Cape Town (UCT), the South African National Research Foundation (NRF), the National Bioinformatics Network (NBN), and the Stanford-South Africa Bio-Medical Informatics Programme (SSABMI), which is funded through US National Institutes of Health Fogarty International Center Grant D43 TW06993, and PharmGKB associates by grant NIH U01GM61374. Thank you to Tina Zhou for setting up the server space for MScanner, and Prof. Vladimir Bajic for a helpful discussion. References
|
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||||||||||||
BMC Med Inform Decis Mak. 2007 Jan 10; 7():1.
[BMC Med Inform Decis Mak. 2007]Bioinformatics. 2007 Jan 15; 23(2):e237-44.
[Bioinformatics. 2007]BMC Bioinformatics. 2007 Oct 30; 8():423.
[BMC Bioinformatics. 2007]Bioinformatics. 2006 Sep 15; 22(18):2298-304.
[Bioinformatics. 2006]Nucleic Acids Res. 2002 Jan 1; 30(1):163-5.
[Nucleic Acids Res. 2002]BMC Bioinformatics. 2006 Aug 7; 7():370.
[BMC Bioinformatics. 2006]PLoS Biol. 2004 Nov; 2(11):e309.
[PLoS Biol. 2004]Brief Bioinform. 2005 Mar; 6(1):57-71.
[Brief Bioinform. 2005]BMC Bioinformatics. 2007 Jul 26; 8():269.
[BMC Bioinformatics. 2007]J Biomed Inform. 2005 Oct; 38(5):404-15.
[J Biomed Inform. 2005]BMC Bioinformatics. 2003 Mar 27; 4():11.
[BMC Bioinformatics. 2003]J Am Med Inform Assoc. 2005 Mar-Apr; 12(2):207-16.
[J Am Med Inform Assoc. 2005]BMC Bioinformatics. 2006 Aug 7; 7():370.
[BMC Bioinformatics. 2006]Bioinformatics. 2001 Apr; 17(4):359-63.
[Bioinformatics. 2001]AMIA Annu Symp Proc. 2006; ():161-5.
[AMIA Annu Symp Proc. 2006]BMC Bioinformatics. 2007 Oct 30; 8():423.
[BMC Bioinformatics. 2007]Int J Med Inform. 2006 Jun; 75(6):488-95.
[Int J Med Inform. 2006]J Am Med Inform Assoc. 2005 Mar-Apr; 12(2):121-9.
[J Am Med Inform Assoc. 2005]J Biomed Discov Collab. 2006 Mar 13; 1():2.
[J Biomed Discov Collab. 2006]J Biomed Inform. 2005 Oct; 38(5):404-15.
[J Biomed Inform. 2005]J Biomed Discov Collab. 2006 Mar 13; 1():2.
[J Biomed Discov Collab. 2006]J Biomed Inform. 2005 Oct; 38(5):404-15.
[J Biomed Inform. 2005]J Biomed Inform. 2005 Oct; 38(5):404-15.
[J Biomed Inform. 2005]Radiology. 1982 Apr; 143(1):29-36.
[Radiology. 1982]Pharmacogenetics. 2004 Sep; 14(9):577-86.
[Pharmacogenetics. 2004]J Biomed Discov Collab. 2006 Mar 13; 1():2.
[J Biomed Discov Collab. 2006]BMC Bioinformatics. 2007 Jul 26; 8():269.
[BMC Bioinformatics. 2007]Bioinformatics. 2006 Jul 15; 22(14):e220-6.
[Bioinformatics. 2006]Adv Dent Res. 2003 Dec; 17():115-20.
[Adv Dent Res. 2003]BMC Bioinformatics. 2007 Jul 26; 8():269.
[BMC Bioinformatics. 2007]Int J Med Inform. 2006 Jun; 75(6):488-95.
[Int J Med Inform. 2006]Nucleic Acids Res. 2002 Jan 1; 30(1):163-5.
[Nucleic Acids Res. 2002]