![]() | ![]() |
Formats:
|
||||||||||||||||||||||||
Copyright This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose Overcoming Terminology Barrier Using Web Resources for Cross-Language Medical Information Retrieval 1 Department of Computer Science and Information Engineering, National Cheng Kung University, Taiwan, ROC 2 Stanford Medical Informatics, Stanford, CA Abstract A number of authoritative medical websites, such as PubMed and MedlinePlus, provide consumers with the most up-to-date health information. However, non-English speakers often encounter not only language barriers (from other languages to English) but also terminology barriers (from laypersons’ terms to professional medical terms) when retrieving information from these websites. Our previous work addresses language barriers by developing a multilingual medical thesaurus, Chinese-English MeSH, while this study presents an approach to overcome terminology barriers based on Web resources. Two techniques were utilized in our approach: monolingual concept mapping using approximate string matching and crosslingual concept mapping using Web resources. The evaluation shows that our approach can significantly improve the performance on MeSH concept mapping and cross-language medical information retrieval. INTRODUCTION A number of authoritative medical websites, such as PubMed and MedlinePlus, provide consumers with the most up-to-date health information. However, most of the information is available in English only1, non-English consumers therefore often encounter great problems of accessing information from these English medical websites due to language barriers. Moreover, there are additional terminology barriers for consumers in finding medical information because most of the thesauri for medical information retrieval (e.g., MeSH) are represented in professional terminology whereas the consumer’ queries are often based on laypersons’ terminology. For example, most Chinese consumers know the Chinese layperson’s term
(dementia for aged people) but not the professional medical term
(Alzheimer’s Disease). By using the Chinese laypersons’ term, they are not able to retrieve the information about Alzheimer’s Disease as they could if they were using the Chinese medical term. Overcoming terminology barriers is as crucial as crossing language barriers in information retrieval.Our previous efforts have been focused on developing techniques for cross-language medical information retrieval (CLMIR) [1]. Multilingual medical thesaurus plays a crucial role in CLMIR according to the experience of the CliniWeb and other CLMIR systems [2, 3]. MeSH is one of the most significant multilingual medical thesaurus, but its traditional Chinese version is unavailable before 2005. We have employed a Web-based term translation method to semi-automatically compile over 19,000 entries of the Chinese-English MeSH in 2005 [1]. Based on the Chinese-English MeSH, we developed a prototype of cross-language medical metasearch engine, called MMODE2 (Multilingual Medical Online Data Explorer), which can assist consumers to retrieve top-quality English medical information by using Chinese medical terms. However, many Chinese consumers are still not able to find the information they want in MMODE since they cannot query MMODE by Chinese medical terms. This demonstrates terminology barriers are as serious as language barriers in medical information retrieval. Thus, in order to retrieve English medical information for Chinese consumers, it becomes a critical issue to deal with not only Chinese medical terms but also Chinese laypersons’ terms when mapping Chinese terms to MeSH concepts. To further improve the performance in CLMIR for MMODE, we are developing effective strategies for dealing with the problems of terminology barriers based on the Chinese-English MeSH and Web resources. In this study, we propose two simple but feasible methods: mapping monolingual concepts using approximate string matching and mapping crosslingual concepts using Web search results [4, 5]. METHOD In this section, we introduce our approach that can overcome terminology barriers for handling query translation and improve retrieval performance in CLMIR. We use the Chinese-English MeSH and Web search results to map Chinese query terms into the Chinese-English MeSH concepts. The first method, Monolingual Concept Mapping, can solve a large proportion of MeSH concept mapping and thus improve CLMIR performance; whereas the second method, Cross-lingual Concept Mapping, may be effective when the Chinese laypersons’ terms do not share the common substrings with the professional terms. The methods are presented in the following. Monolingual Concept Mapping Using Approximate String Matching For each Chinese query term, we can find its corresponding English MeSH concept easily if this Chinese query term appears in the Chinese-English MeSH. However, this optimal scenario does not happen often. Many English medical terms in MeSH have multiple corresponding Chinese terms (including layperson’s terms), and not all of these Chinese terms are included in the Chinese-English MeSH. For example, the English term “adrenoleukodystrophy” has several corresponding Chinese terms:
and
, etc. Among them, only the professional term
is included in Chinese-English MeSH. These professional terms appearing in Chinese-English MeSH are called “Chinese MeSH concepts” in the following text.Fortunately, some of the non-professional terms might share common substrings with the professional terms appearing in the Chinese-English MeSH. Thus, we adopt the Dice Coefficient to calculate the string similarity by counting the number of characters co-occurring in both Chinese query terms and Chinese MeSH concepts.
where Nq and Nc are the number of characters in the query term q and the Chinese MeSH concept Mc, and Nqc is the number of character overlapping in both q and c. We predefined a threshold (0.8) to filter some incorrect concept mapping. Crosslingual Concept Mapping Using Web Search Results The method presented above can solve a large proportion of monolingual concept mapping for query translation. However, a few laypersons’ terms cannot be mapped to Chinese MeSH concepts because they do not share the common substrings with the professional term, such as the layperson’s term
(progressive freezing) could not be mapped to its professional term
(amyotrophic lateral sclerosis). Certainly, it is a challenge to map this kind of Chinese laypersons’ terms to Chinese-English MeSH concepts. We approach this problem by using crosslingual concept mapping.According to our observations, many Chinese search-result pages from search engines contain rich snippets of summaries with a mixture of Chinese and English texts. Therefore, when we search explicitly for the term
in Chinese-language pages from Google, it is likely that the search results will include relevant snippets containing its corresponding English MeSH concept “amyotrophic lateral sclerosis”. The technique of crosslingual concept mapping is based on our proposed Web-based term translation method previously [1] and crosslingual concept mapping model in this paper. The procedure is summarized below:
APPLICATION: CLMIR SYSTEM To demonstrate the effectiveness of our method, we developed MMODE, a cross-language metasearch engine, to help Chinese consumers retrieve health information, including research literature, popular news, related Web pages, and related images from the medical websites. Figure 1
The user interface of MMODE is shown in Figure 2
Figure 2 (Amyotrophic Lateral Sclerosis). The system returned related documents from PubMed and related images from Google.EVALUATION Concept Mapping Evaluation Test sets We conducted experiments on three test sets.
We adopt coverage rate as an evaluation metric to assess the effectiveness of concept mapping. Result Table 2 shows that the coverage rates of MeSH concept mapping for the test sets CD-1 and CD-2 achieved 68.42% and 57.89% by using the proposed monolingual concept mapping method. However, for the set CD-3, monolingual concept mapping method and crosslingual concept mapping method obtained 10% and 70%, respectively. Relatively, monolingual concept mapping method preformed bad whereas crosslingual concept mapping method made more significant improvements.
CLMIR Evaluation Medical text corpus To determine the effectiveness of our proposed approach in CLMIR, we conducted preliminary experiments on the medical corpus from MuchMore, which is a parallel corpus composed of 9,000 English-German scientific medical abstracts. We used only the English part of the corpus. Test queries MuchMore also provides 25 English-German bilingual test queries with relevance judgment. We had the 25 English queries translated into Chinese by a Chinese physician and took these Chinese queries as test queries. Experiment procedure To compare our proposed approach against other conventional query translation techniques in CLIR, we also evaluated the performance of CLMIR using a general bilingual dictionary (CEDICT 5 ), medical bilingual dictionary6, and the well-known machine translation system SYSTRAN. The performance was evaluated by the measures, such as precision (P), recall (R), and F-measure (F). Result The performance of CLMIR using different methods is summarized in Table 3. The combination of the Chines-English MeSH and the proposed monolingual and crosslingual concept mapping methods achieved the best performance at 0.234 (about 93% (0.234/0.252) of monolingual performance). The conventional query translation techniques (i.e., general dictionary and machine translation) were not effective for CLMIR.
DISCUSSION In this paper, we illustrated the challenge of terminology barriers in the context of CLMIR. However, the barriers may exist even in monolingual IR. For example, some English laypersons’ terms are not included in MeSH concepts or entry terms, such as “pink eye” (conjunctivitis) and “water pills” (diuretics). Thus, it is also a challenge for English-speaking consumers to find relevant information by using these laypersons’ terms. We proposed an approach to overcome terminology barriers using Web resources. Since more and more health-related websites contain various laypersons’ terms and their corresponding professional terms, it is becoming more appropriate to overcome the barriers using our approach. Moreover, the Web is dynamic: web resources may contain new laypersons’ terms (e.g., new nicknames for diseases), which cannot be included in medical dictionaries or thesauri in a timely manner. The proposed approach can capture these new terms and improve IR performance. In summary, we proposed two methods: 1) using approximate string matching to deal with monolingual concept mapping and 2) using Web search results to handle crosslingual concept mapping. While the first method can solve a large proportion of MeSH concept mapping and improve CLMIR performance, the second method is shown to be effective when the Chinese laypersons’ terms do not share the common substrings with the professional terms. The experiments show that the two methods can make significant improvements on performance in MeSH concept mapping and CLMIR. A limitation of current evaluation results is that the size of the evaluation set is limited. As the performance of methods might be substantially different in different datasets, we would need to evaluate our method on larger-scaled datasets in the future. A challenge encountered in crosslingual concept mapping is robustness of the term translation method. A number of incorrect translations are extracted because it is difficult remove English common terms from search results. The problem can be mitigated by using MeSH as a filter to remove these noise candidates. Another challenge is to map crosslingual concepts for low-frequency layperson’s terms because they rarely appear in Chinese search results. It may be mitigated as the Web grows. In the meantime, we will also investigate new concept mapping methods for these low-frequency terms. Footnotes REFERENCES 1. Lu WH, Lin SJ, Chan YC, Chen KH. Semi-Automatic Construction of the Chinese-English MeSH Using Web-Based Term Translation Method. Proc AMIA Symp. 2005:475–9. [PubMed] 2. Hersh WR, Donohoe LC. SAPHIRE International: A Tool for Cross-Language Information Retrieval. Proc AMIA Symp. 1998:673–7. [PubMed] 3. Volk M, Ripplinger B, Vintar S, Buitelaar P, Raileanu D, Sacaleanu B. Semantic annotation for concept-based cross-language medical information retrieval. Int J Med Inform. 2002 Dec;67(1–3):97–112. [PubMed] 4. Cheng PJ, Teng JW, Chen RC, Wang JH, Lu WH, Chien LF. Translating unknown queries with Web corpora for cross-language informationretrieval. Proc of ACM SIGIR- 2004;2004:146–153. 5. Kimura F, Maeda A, Yoshikawa M, Uemura S. Cross-Language Information Retrieval based on category matching between language versions of a web directory. Proc of ACL. 2003 Jul;2003:153–160. 6. Brill E. Some Advances in Transformation-Based Part of Speech Tagging. Proc National Conference on Artificial Intelligence. 1994:722–727. |
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||||||||||
AMIA Annu Symp Proc. 2005; ():475-9.
[AMIA Annu Symp Proc. 2005]Proc AMIA Symp. 1998; ():673-7.
[Proc AMIA Symp. 1998]Int J Med Inform. 2002 Dec 4; 67(1-3):97-112.
[Int J Med Inform. 2002]AMIA Annu Symp Proc. 2005; ():475-9.
[AMIA Annu Symp Proc. 2005]AMIA Annu Symp Proc. 2005; ():475-9.
[AMIA Annu Symp Proc. 2005]AMIA Annu Symp Proc. 2005; ():475-9.
[AMIA Annu Symp Proc. 2005]AMIA Annu Symp Proc. 2005; ():475-9.
[AMIA Annu Symp Proc. 2005]