Logo of jamiaAlertsAuthor InstructionsSubmitAboutJAMIA - The Journal of the American Medical Informatics Association
J Am Med Inform Assoc. 2012 Jun; 19(e1): e149–e156.
Published online 2012 Apr 4. doi:  10.1136/amiajnl-2011-000744
PMCID: PMC3392861
FOCUS on clinical research informatics

Unified Medical Language System term occurrences in clinical notes: a large-scale corpus analysis



To characterise empirical instances of Unified Medical Language System (UMLS) Metathesaurus term strings in a large clinical corpus, and to illustrate what types of term characteristics are generalisable across data sources.


Based on the occurrences of UMLS terms in a 51 million document corpus of Mayo Clinic clinical notes, this study computes statistics about the terms' string attributes, source terminologies, semantic types and syntactic categories. Term occurrences in 2010 i2b2/VA text were also mapped; eight example filters were designed from the Mayo-based statistics and applied to i2b2/VA data.


For the corpus analysis, negligible numbers of mapped terms in the Mayo corpus had over six words or 55 characters. Of source terminologies in the UMLS, the Consumer Health Vocabulary and Systematized Nomenclature of Medicine—Clinical Terms (SNOMED-CT) had the best coverage in Mayo clinical notes at 106 426 and 94 788 unique terms, respectively. Of 15 semantic groups in the UMLS, seven groups accounted for 92.08% of term occurrences in Mayo data. Syntactically, over 90% of matched terms were in noun phrases. For the cross-institutional analysis, using five example filters on i2b2/VA data reduces the actual lexicon to 19.13% of the size of the UMLS and only sees a 2% reduction in matched terms.


The corpus statistics presented here are instructive for building lexicons from the UMLS. Features intrinsic to Metathesaurus terms (well formedness, length and language) generalise easily across clinical institutions, but term frequencies should be adapted with caution. The semantic groups of mapped terms may differ slightly from institution to institution, but they differ greatly when moving to the biomedical literature domain.

Background and significance

Natural language processing (NLP) is crucial to clinical informatics because the summative information that is stored in millions of clinical notes is too massive to be processed by a human. But automatic methods of processing clinical text have their own challenges, such as the extensive use of specialised medical terms. The Unified Medical Language System (UMLS) Metathesaurus1 has over 8 million strings that an NLP system might consider relevant in clinical text. It is thus common practice for NLP systems1 2 to filter the desired terms by criteria such as lexical redundancy and term ambiguity2 or semantic type.3 Such filters, while reasonable, are uninformed by how the terms behave in clinical text.

The long-term goal of this work is to produce an agile information extraction user interface that allows users to specify terms, concepts and logic relevant to their own problem settings, based on criteria such as frequency, source terminology, syntax and semantic type. To that end, our objective here is twofold: first, to analyse empirical instances of UMLS term strings in a large clinical corpus; and second, to illustrate what types of term characteristics are generalisable across data sources. The resulting statistics and principles may then be used in user-directed filtering of lexicons (eg, using Lexicon Builder4) for practical clinical NLP systems. This may also improve system efficiency—the full Metathesaurus (prohibitively, for some users) requires several gigabytes of memory to serve as a lexicon for many algorithms.

This paper therefore explores the characteristics of Metathesaurus term matches in clinical text along dimensions such as term length, term frequency, source terminology, syntactic category and semantic group. The data source used is a corpus of over 51 million patient notes gathered over a 10-year period at the Mayo Clinic. A variant of the standard Aho-Corasick string matching algorithm5 6 is run on the data to find term matches, and these data are paired against existing information from Mayo's enterprise NLP system, Clinical Notes Indexing (CNI),7 a precursor to Mayo's open-source NLP system, cTAKES.3 The paper also examines the transferability of corpus statistics by applying a set of Mayo-based filtering parameters to the i2b2/VA NLP Challenge corpus.8 This cross-institutional test provides some insight on which statistical metrics are mainly beneficial within one setting and which are broadly applicable.

After a brief discussion on related work, the remainder of this article introduces the data and methods for empirical term matching in clinical corpora, analyses the Mayo Clinic corpus of clinical notes, applies and analyses a practical set of filters and draws a few conclusions for NLP tasks.

Related work

The UMLS Metathesaurus1 is constantly growing as its source terminologies grow; its 2011AA release contains 155 sources with 8 335 125 different strings for terms in 21 languages, and 2 404 937 different concept unique identifiers. As a thesaurus, the Metathesaurus is designed to match identical concepts from different source terminologies, and it has thus been used frequently as a normalisation target for NLP methods.3 7 9–11 Our previous work has analysed the large-scale distribution of UMLS clinical concepts.12

The Metathesaurus has also been commonly used as a lexicon2 to supply term strings that might be identified in clinical text, which is slightly different than the concept-oriented focus for which it was designed. This incongruency has been addressed to some degree in MetaMap, an NLP system from the National Library of Medicine, which allows some configurable filtering of the lexicon.2 13 This filtering is helpful, but lacks the ability to provide a user with in-domain, empirically based recommendations. With the rise in computational power and the increasing availability of biomedical ontologies, we believe that a corpus-driven approach14 is feasible for principled lexicon filtering.

Constructing practical string-oriented lexicons through filtering has been attempted via statistical models and via rule-based systems. Statistical models typically identify a number of properties that allow prediction of the likelihood of a given string being found or not found in a corpus.15 An excellent recent rule-based study by Hettne et al16 recommends applying five rewrite rules (of nine studied) and seven suppression rules (of eight studied) to the UMLS before it is used for biomedical term identification in MEDLINE.16 Our work complements these attempts by highlighting the large-scale effects of the lexicon-building technique of term suppression.

In the biomedical literature domain, the efforts at lexicon creation are quite advanced; for example, the BioLexicon gathers terms from existing data resources into a single, unified repository, and augments them with new term variants extracted from biomedical literature.17 Efforts by Baral et al provide an online dictionary of diseases and drugs based on frequency analysis in Medline (http://bioai4core.fulton.asu.edu/snpshot/download.html). Our work in analysing a large-scale clinical corpus provides a principled foundation for creating such resources in the clinical domain.

Other corpus studies have been conducted which analyse variability in subdomains,18 sections of a document,19 large-scale semantic characteristics of biomedical literature abstracts,20 and longitudinal semantic shift.21 Our previous work also includes comparisons between concepts in the clinical and biomedical domains.12 Here, we undertake the first known enterprise-scale exploration of clinical text that centres on term strings actually present in the text.

Data and methods

Data sources

The data source for the corpus analysis of clinical text was Mayo Clinic clinical notes between 1 January 2001 and 31 December 2010, retrieved from the Mayo's Enterprise Data Trust (EDT).22 The EDT stores structured data, unstructured text and CNI-produced annotations7 from a comprehensive snapshot of Mayo Clinic's service areas, excluding only microbiology, radiology, ophthamology and surgical reports. Additionally, each possible note type at Mayo was represented: clinical note, hospital summary, post-procedure note, procedure note, progress note, tertiary trauma and transfer note.

For the evaluation of a sample filter, the i2b2/VA 2010 NLP Challenge data8 were used. This corpus contained a total of 871 manually annotated, de-identified reports from Partners Healthcare, Beth Israel Deaconess Medical Center and the University of Pittsburgh Medical Center. The majority of notes were discharge summaries, but the University of Pittsburgh Medical Center also contributed progress reports.

String matching algorithm

Our string matching procedure implemented a modified Aho-Corasick algorithm.5 This algorithm takes a dictionary and constructs a finite state machine with efficient transitions between alphabet string states for failed matches. Our modification uses normalised words as the alphabet, but we store the original strings for each match and report results on exact matches.

We used the UMLS Metathesaurus as a lexicon. Due to computational constraints we filtered out entries with 10 or more words and those that were not between 3 and 100 characters. Because the algorithm used the UMLS Metathesaurus there were concept unique identifiers available for each string match. We used this normalised representation to find type unique identifiers and characterise the semantic types of the strings.

Data collection and preparation

For corpus analysis, we retrieved text documents from the EDT repository, with 51 945 627 documents represented from 2000 to 2010. The dictionary lookup procedure described above found any UMLS terms in the text documents. For analysis by syntactic category, we retrieved CNI-produced syntactic chunks7 for the same set of documents, and the dictionary lookup procedure was applied to the text of these chunks. This yielded the syntactic category for the majority of term occurrences in the text.

For the last step of examining the cross-institutional transferability of statistics, we used the 2010 i2b2/VA NLP Challenge data without modification. As above, the dictionary lookup procedure mapped UMLS terms in the i2b2/VA data.

Results and analysis

Corpus analysis

Aggregate characteristics

In the corpus of 51 945 627 clinical documents, there are a total of 2 319 010 575 case-insensitive exact term matches, drawing from 296 167 unique terms. This amounts to 44.64 matches per document on average and only utilises 3.56% of the available case-insensitive terms in the UMLS. It is thus clear that we do not need to search the full Metathesaurus in the course of a concept mapping procedure.

However, we should not overestimate how much the terminologies may be filtered, as the dictionary lookup algorithm used was fairly unsophisticated. In fact, it is unlikely that there are so few terms per document in clinical text. Xu et al report 19 million Medline abstracts to have 530.45 matched terms per document using 13% of the unique strings in the UMLS.20 This difference is particularly stark in light of the fact that the clinical documents have, on average, three times as many characters (about 2500) as biomedical abstracts.

The larger number of biomedical matches is likely indicative of the fact that the biomedical text covers a broader range of topics than clinical text. It is also difficult for exact dictionary matches to fully capture the range of synonymous expressions, abbreviations and misspellings that are found in clinical text. For example, the strings ‘dispo’ (abbreviation for the disposition of a patient) and ‘00Cardiac implant’ (tokenisation problems) both occur in the Mayo corpus but are not identifiable.

All of these factors point to a large difference between the clinical and biomedical domains, and also to the need for a clinical data-specific study such as this one.

Word and character statistics

As previously mentioned, the UMLS Metathesaurus was designed as a controlled thesaurus not a lexicon. It therefore contains concepts that include an excessive number of words or characters and are not of use to NLP techniques. Figure 1 shows histograms for the number of words in the UMLS and in the subset that is empirically found in Mayo Clinic data.

Figure 1
The number of words in a term versus relative frequency of Unified Medical Language System (UMLS) terms with that number of words.

It should be clear that the mappable dictionary terms from the UMLS are shorter on average than the full set of UMLS terms. Subsetting to these 296 167 terms reduces the average characters per term from 37.27 to 17.83 and average words per term from 4.80 to 2.41, similar to the characteristics reported in the biomedical domain. The same is seen to be true when examining the number of characters in UMLS terms, as in figure 2.

Figure 2
The number of characters in a term versus how many Unified Medical Language System (UMLS) terms had that number of characters.

These findings suggest that filtering out high word counts or character counts may be a safe way to remove unnecessary terms from a lexicon.

Term frequency and TF−IDF

To understand what types of UMLS strings are found in clinical text, we now consider some traditional metrics for the importance of a term. Figure 3 shows the distribution of the top 5000 term frequencies in each domain.

Figure 3
Distribution of the most frequent terms in clinical versus biomedical data.

We have scaled the y-axis for biomedical term frequencies to be comparable with the clinical domain. The x-axis is ordered by term frequency (tf) ranking, where the top strings are seen in table 1A,B. We can see that few terms are used frequently (the left portion of figure 3) and many terms are used infrequently (the bottom/right portion), and this characteristic is consistent across both domains. This is reminiscent of Zipf's Law, which describes the empirical frequency distribution of words in general language as having a large peak and a heavy, one-sided tail. By the technical log–log plot definition of a Zipfian distribution, we would see that this is near-Zipfian but the tail is not as heavy.

Table 1
Top terms in clinical text (Mayo corpus) and biomedical text (Medline 2011), by term frequency

From table 1 it is evident that in both domains, the most frequent terms are general rather than specific, and reflect the domains from which they arise. In 51 million documents, 7.7% of terms only occurred once; the 0%, 25%, 50%, 75% and 100% quantiles are at 1, 3, 18, 85 and 38 434 437 occurrences, respectively.

We additionally obtained the tf−idf weight of each term for the clinical corpus as in table 2. Tf−idf weights are defined by tfdf=tflog(N/df), where n is the number of documents in the corpus and df is the number of documents a term occurs in. They are commonly used in information retrieval to measure the importance of terms, with the intuition that terms that occur often in every document are less distinctive than those that occur often in a few documents. Note that the top terms are very similar to the term frequency-ranked versions.

Table 2
Top terms in clinical text by tf–idf weight

Figure 4 visualises this comparison by showing the tf rank (x-axis) with the tf−idf values (y-axis)—they are still highly consistent. From here, we see that traditional information retrieval metrics such as tf−idf may be somewhat limited in their ability to discover truly valuable, discriminative words in the clinical domain.

Figure 4
Tf−idf values of the most frequent terms in clinical data.

This ineffectiveness of inverse document frequency is likely due to the fact that the clinical domain is highly specialised by note type and subdomain. The term ‘patient’ is discriminative in some respects: it can be easily found in progress notes and discharge summaries, but is much less likely to be found in notes like pathology or radiology reports.

Source terminology

Here, we compare the number of strings per terminology in the raw UMLS (table 3A) with the most commonly used terminologies (by number of terms represented) in the clinical and biomedical domains (table 3B,C).

Table 3
Top source vocabularies and their degree of utilisation, by number of unique term strings in clinical notes

These tables show which terminologies are best for each domain, ranked by the number of unique case-insensitive terms used. Tables 3B and 3C also include what percentage of the terms in the full terminology are used. Interestingly, the new Consumer Health Vocabulary contains only 148 383 terms but accomplishes excellent coverage of terms in both domains because it was designed for natural language contexts. The Systematized Nomenclature of Medicine—Clinical Terms (SNOMED-CT) is the largest source ontology in the UMLS and was developed specifically as a clinical resource. As such, it is one of the most important terminologies in the clinical domain. Similarly, Medical Subject Headings (MSH) was developed specifically for indexing biomedical literature and therefore captures the most terms from biomedical abstracts.

The percentage usage of each of these ontologies is lower in the clinical domain than in the biomedical domain, again likely due to applying an exact case-insensitive string match to highly varied clinical notes. Low usage rates in the clinical domain, for example, SNOMED-CT, also indicate that the resource may best contribute to a lexicon after some filtering along other dimensions.

Semantic groups

As mentioned above, the frequent words in the clinical domain differ from those in the biomedical domain. This is most easily seen in figure 5A,B.

Figure 5
(A) Frequencies of terms discovered in clinical versus biomedical text, by semantic group; (B) number of unique terms, by semantic group.

The percentages of matched strings are compared by semantic group and they differ greatly. Here, we follow Bodenreider and McCray's 15 semantic groups23 of semantic types (UMLS Type Unique Identifiers) figure 6.

Figure 6
Percentage of unique terms that are noun phrase (NP) dominated, by semantic group.

These plots display predictable domain differences in semantic type distribution of terms. Clinical data focus on disorders, anatomy, medications and procedures. cTAKES and CNI are examples of intentional semantic type-based filtering for clinically relevant types, in which five semantic groups are kept, accomplishing 59.60% coverage of occurrences and 82.74% coverage of unique strings.

Note that the difference between the clinical and biomedical domains is very significant. Type filters designed for one domain should not be applied to another, though some semantic groups are relatively infrequent to both domains.

Syntactic categories

Across the Mayo clinical notes in this study, we found that Across the Mayo clinical notes in this study, 90.18% of clinical term mentions were found in noun phrase (NP) chunks; Xu et al found similar NP-dominance characteristics in biomedical data. Figure 6 stratifies the clinical NP-dominance characteristics by semantic group. While filtering out non-NP constructions is commonplace in many clinical NLP systems, it should be done with caution in for semantic groups like “Procedures” or “Activities & Behaviors”.

It should be noted that this depends on a sound chunking procedure, and there were some limitations to the accuracy of the IBM shallow parser in CNI: there were terms that resided in incorrect chunks and those that were not in any chunk. However, as string-matched terms occur across the whole distribution of the text, this noise is overcome on average.

Cross-institutional analysis

Based on the corpus analysis on Mayo data above, we defined an example configuration of filters for use-case agnostic information extraction in clinical notes, and applied these candidate filters to string-matched i2b2/VA data to examine their trans-institutional applicability.

A Mayo-based filtering configuration

We implemented eight lexicon filters:

  1. Special characters. The UMLS contains fine-grained semantic distinctions that are indicated with punctuation, for example, ‘[D] Respiratory insufficiency (finding)’ versus ‘Respiratory insufficiency, NOS.’ This UMLS-intrinsic filter removes a term from the lexicon if and only if it begins with ‘[’ ends with‘)’ or contains a comma.20
  2. Maximum number of words. Given the histogram in figure 1, fewer than 1000 terms have seven words. Thus, we eliminate terms with seven or more words, removing over a quarter of UMLS terms.
  3. Maximum number of characters. Given the histogram in figure 2, only 39 terms have 56 or more characters. We thus eliminate terms with fewer than 2 characters or more than 55 characters, removing over a fifth of UMLS terms.
  4. Language. Fifteen languages are represented in the UMLS. Filtering to English terms reduces the set of UMLS terms by almost a third.
  5. Source terminology. Many UMLS source terminologies are not designed to be lexicons (eg, International Classification of Diseases, ninth revision billing codes). We keep only the top 14 English sources out of the possible 155: SNOMED-CT, Consumer Health Vocabulary, National Cancer Institute (NCI) Thesaurus, Medical Subject Headings (MSH), Read Codes, Medical Dictionary for Regulatory Activities Terminology (MedDRA), SNOMED International, MEDCIN, UMLS Metathesaurus, National Drug File—Reference Terminology (NDF-RT), the original SNOMED, Online Mendelian Inheritance in Man (OMIM), Logical Observation Identifiers Names and Codes (LOINC) and Computer Retrieval of Information on Scientific Projects (CRISP) Thesaurus.
  6. Semantic group. Of the 15 semantic groups, over 92% of Mayo Clinic terms come from only 7: anatomy, chemicals & drugs, concepts & ideas, disorders, living beings, physiology, and procedures.
  7. Empirical occurrence filter. We filter out those terms that never appeared in the Mayo corpus. This leaves the full set of Mayo Clinic term occurrences and tests the transferability of a specific lexicon across institutions.
  8. Term frequency. A total of 99.99% of mentions can be retained if we eliminate terms that occurred only once or twice in the Mayo corpus. This is a subset of the empirical occurrences filter, since zero occurrences are also eliminated.

Cross-institutional filtering evaluation

Table 4 reports the impact of this filtering. First, we begin with a baseline of the full UMLS. The top left cell indicates the number of unique UMLS terms. Rows show the lexicon size reduction effect of individual filters against this baseline. The final rows apply multiple filters at once.

Table 4
Transferability of corpus-based filtering of the Unified Medical Language System (UMLS)

The left ‘UMLS’ columns analyze how much of the UMLS Metathesaurus remains after each of the filters, and larger percent reduction values correspond to more memory-efficient systems. The middle ‘Mayo’ columns evaluate the reasoning for choosing these filter definitions. For example, our semantic groups filter (filter 6 in table 4) uses only seven semantic groups. Reading the row from left to right, it reduces the size of the lexicon to 7 798 937 (a 6.43% reduction), keeps 273 300 of the 296 798 unique terms (ie, excludes 7.92%), and keeps 2.289×109 of the 2.376×109 term occurrences (ie, excludes 3.68%) for the Mayo corpus. As a whole, the filters defined in this example might be reasonable for some information extraction applications, excluding only 5.57% of all mentions. The right ‘i2b2/VA’ columns are defined by using Mayo-based filters on term matches from the i2b2/VA corpus.

Our cross-institutional evaluation lies in comparing the ‘Mayo’ columns with the ‘i2b2/VA’ columns. Filters 1–4 seem to apply similarly and accurately across the two corpora. This is to be expected because they largely deal with systematic intrinsic properties of the term strings in the UMLS and should not depend on corpora. The remaining filters differ between Mayo and i2b2/VA data, indicating that statistical analysis along those lines should only be transferred across data sources with caution.

The source terminology filter removed far less a proportion of unique terms in the i2b2/VA corpus (6.14%) than in the Mayo corpus (15.31%). This is probably due to the vast size difference between the two corpora: recalling figure 3, a heavy tail distribution within large corpora means that many uncommon terms are mapped in the Mayo Corpus, but not in the i2b2/VA corpus. We may conclude that filtering by source reduces the diversity of available terms, but the most frequent terms are captured in a small number of sources.

In i2b2/VA data, the semantic group filter excludes a higher proportion of unique terms but a smaller proportion of term mentions than in Mayo data. The variability is not great, however, compared with the differences with the biomedical literature domain in figure 5A,B. We conclude that though different clinical corpora may have slightly different distributions, their utilised terms are still relatively similar to each other in semantic groups.

Perhaps most instructive are filters 7–8, the empirical occurrences and the term frequency filters. Both of these filters exclude a smaller proportion of unique terms in the i2b2/VA data (1.39% for filter 8) than in the Mayo data (23.62% for filter 8), again likely due to the corpus size differential. However, a larger proportion of i2b2/VA mentions are excluded (15.23% for filter 8) than Mayo mentions (0% for filter 8). Despite the fact that these are both clinical corpora, term frequencies have vastly different characteristics in the different corpora. Although these statistics are standard NLP techniques, they would appear to be more helpful within an institution than across sources of data.


The foregoing cross-institutional test aligns with envisioned applications because a user of a practical application like Lexicon Builder4 will need guidance on what filters to choose. It would be safe in such a situation to apply filters that have been validated across institutions, but other filters should be applied with caution. A final recommendation for use-case agnostic information extraction in the i2b2/VA corpus, as presented in table 4, might be to utilise filters 1–5. These simple filters achieve fivefold reduction in lexicon size (efficiency) while preserving almost 94% of unique terms and almost 98% of mentions in the corpus.

The semantic group filter has limited utility because it does not greatly reduce the dictionary size. However, other factors, such as the limitations of a corresponding human annotation effort, may be reasons for narrowing the scope of general information extraction to specific semantic groups.

Most importantly, the results show that the empirical occurrences and term frequency filters are highly institution specific. Any methodologies developed off of these statistics should take care to complete a preliminary corpus analysis rather than directly using the Mayo Clinic statistics.

Unlike our previous work,12 the preceding analysis does not attempt to calculate or analyse concept-level semantics. Although mixing the two analyses is an interesting problem, our term-level analysis is natural for the envisioned problem setting, where a user is building a lexicon of strings for concept indexing—concept normalisation would presumably be a downstream task. Additionally, we do not calculate the ‘usefulness’ of filters in real-world applications because such measures typically require a concept-centric focus.

Conclusion and future work

Based on the occurrences of terms in a 51 million document corpus of Mayo Clinic clinical notes, this paper has presented a suite of statistics on UMLS term occurrences in the clinical domain, and has evaluated the cross-institutional applicability of these statistics. We have shown several measures that are intrinsic to their Metathesaurus entries (term well-formedness, length and language) that generalise easily across clinical institutions. Term frequencies are highly variable across institutions and should be adapted across domains or institutions with caution. The semantic groups of mapped terms may differ slightly from institution to institution, but the distance between institutions is much smaller than that between the clinical and biomedical literature domains.

We believe this analysis makes it possible for end users to build customised, empirically informed lexicons from the UMLS. Implementationally, this team plans on enhancing Lexicon Builder4 with the statistics presented above. Other future work includes the further characterisation of clinical note sections (eg, terms may differ in history of present illness vs discharge diagnosis sections), types of notes (eg, discharge summaries vs operative reports), co-occurrence information (ie, utilising latent semantic information), and ontological structure (eg, which branches in an ontology are more useful).

As mentioned, a concept-centric analysis and its relationship to our term-centric analysis are also areas of future work. A concept-centric filtering evaluation, for example, may actually show that precision could be improved by filtering, since it could remove ‘distracting’ terms.

While the coverage of lexicons derived out of biomedical ontologies is impressive, clinical writing contains many more variants. We plan to generate accurate variants by analysing lexical variants, synonyms and related terms at a large scale.


The authors would like to acknowledge Rong Xu, Vinod Kaggal and Yipei Liu for their help on the experiments and statistics, and the anonymous reviewers for their thorough feedback.


Contributed by

Contributors: SW carried out the experiments, led the study design and analysis and drafted the manuscript. HL and DL helped with coding the experiments and with manuscript drafting. NS and CT enabled the comparisons with biomedical data, and NS also helped draft the manuscript. MM and CC provided institutional support and manuscript editing.

Funding: This work was supported in part by the NIH Roadmap Grant U54 HG004028. This study was also supported by National Science Foundation ABI:0845523, National Institute of Health R01LM009959A1, and the SHARPn (Strategic Health IT Advanced Research Projects) Area 4: Secondary Use of EHR Data Cooperative Agreement from the HHS Office of the National Coordinator, Washington, DC, DHHS 90TR000201.

Competing interests: None.

Ethics approval: Ethics approval was provided by Mayo Clinic Institutional Review Board.

Provenance and peer review: Not commissioned; externally peer reviewed.

Data sharing statement: The aggregate corpus statistics in this paper will be released open-source at a future date as a part of the Lexicon Builder web service.


1. Lindberg DA, Humphreys BL, McCray AT. The Unified Medical Language System. Methods Inf Med 1993;32:281. [PubMed]
2. Aronson AR, Lang FM. An overview of MetaMap: historical perspective and recent advances. J Am Med Inform Assoc 2010;17:229–36 [PMC free article] [PubMed]
3. Savova GK, Masanz JJ, Ogren PV, et al. Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications. J Am Med Inform Assoc 2010;17:507–13 [PMC free article] [PubMed]
4. Parai GK, Jonquet C, Xu R, et al. The Lexicon Builder Web service: building custom lexicons from two hundred biomedical ontologies. AMIA Annu Symp Proc 2010;2010:587–91 [PMC free article] [PubMed]
5. Aho AV, Corasick MJ. Efficient string matching: an aid to bibliographic search. Commun ACM 1975;18:333–40
6. Dai M, Shah NH, Xuan W, et al. An efficient solution for mapping free text to ontology terms. AMIA Summit on Translational Bioinformatics, San Francisco, CA, 2008
7. Savova G, Kipper-Schuler K, Buntrock J, et al. UIMA-Based Clinical Information Extraction System. Towards Enhanced Interoperability for Large HLT Systems: UIMA for NLP. Proceedings paper, LREC (Languages Resources and Evaluation Conference), Marrakech, Morocco, 2008:39
8. Uzuner O, South BR, Shen S, et al. 2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text. J Am Med Inform Assoc 2011;18:552–6 [PMC free article] [PubMed]
9. Aronson A. Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program. Proc AMIA Symp 2001:17–21 [PMC free article] [PubMed]
10. Nadkarni P, Chen R, Brandt C. UMLS concept indexing for production databases: a feasibility study. J Am Med Inform Assoc 2001;8:80–91 [PMC free article] [PubMed]
11. Denny JC, Smithers JD, Miller RA, et al. ‘Understanding’ medical school curriculum content using KnowledgeMap. J Am Med Inform Assoc 2003;10:351–62 [PMC free article] [PubMed]
12. Wu S, Liu H. Semantic characteristics of NLP-extracted concepts in clinical notes vs. biomedical literature. Annual Symposium of American Medical Informatics Association, Washington DC, WA, 2011 [PMC free article] [PubMed]
13. Aronson AR. Filtering the UMLS Metathesaurus for MetaMap. National Library of Medicine Technical Report. 2006
14. Halevy A, Norvig P, Pereira F. The unreasonable effectiveness of data. Intelligent Systems IEEE 2009;24:8–12
15. McCray AT, Bodenreider O, Malley JD, et al. Evaluating UMLS strings for natural language processing. Proc AMIA Symp 2001:448. [PMC free article] [PubMed]
16. Hettne KM, van Mulligen EM, Schuemie MJ, et al. Rewriting and suppressing UMLS terms for improved biomedical term identification. J Biomed Semantics 2010;1:5. [PMC free article] [PubMed]
17. Thompson P, McNaught J, Montemagni S, et al. The BioLexicon: a large-scale terminological resource for biomedical text mining. BMC Bioinformatics 2011;12:397. [PMC free article] [PubMed]
18. Lippincott T, Seaghdha D, Sun L, et al. Exploring variations across biomedical subdomains. Proceedings of International Conference on Computational Linguistics, Beijing, China, 2010:689–97
19. Cohen KB, Johnson HL, Verspoor K, et al. The structural and content aspects of abstracts versus bodies of full text journal articles are different. BMC Bioinformatics 2010;11:492. [PMC free article] [PubMed]
20. Xu R, Musen MA, Shah NH. A comprehensive analysis of five million UMLS metathesaurus terms using eighteen million MEDLINE citations. AMIA Annu Symp Proc 2010;2010:907–11 [PMC free article] [PubMed]
21. Michel JB, Shen YK, Aiden AP, et al. Quantitative analysis of culture using millions of digitized books. Science 2011;331:176–82 [PMC free article] [PubMed]
22. Chute CG, Beck SA, Fisk TB, et al. The enterprise data trust at Mayo Clinic: a semantically integrated warehouse of biomedical data. J Am Med Inform Assoc 2010;17:131. [PMC free article] [PubMed]
23. Bodenreider O, McCray AT. Exploring semantic groups through visual approaches. J Biomed Inform 2003;36:414–32 [PMC free article] [PubMed]

Articles from Journal of the American Medical Informatics Association : JAMIA are provided here courtesy of American Medical Informatics Association
PubReader format: click here to try


Save items

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...


  • PubMed
    PubMed citations for these articles

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...