![]() | ![]() |
Formats:
|
||||||||||||||||||||||
Copyright This is an Open Access article: verbatim copying and redistribution of
this article are permitted in all media for any purpose Semi-Automatic Construction of the Chinese-English MeSH Using Web-Based
Term Translation Method 1 Department of Computer Science and Information Engineering, National Cheng Kung University, Taiwan, ROC 2 Institute of Information Science, Academia Sinica, Taiwan, ROC This article has been cited by other articles in PMC.Abstract Due to language barrier, non-English users are unable to retrieve the most
updated medical information from the U.S. authoritative medical websites, such
as PubMed and MedlinePlus. A few cross-language medical information
retrieval (CLMIR) systems have been utilizing MeSH (Medical
Subject Heading) with multilingual thesaurus to bridge the gap. Unfortunately, MeSH
has yet not been translated into traditional Chinese currently. We proposed a semi-automatic approach to constructing Chinese-English MeSH
based on Web-based term translation. The system provides knowledge
engineers with candidate terms mined from anchor texts and search-result
pages. The result is encouraging. Currently, more than 19,000 Chinese-English
MeSH entries have been compiled. This thesaurus will be used
in Chinese-English CLMIR in the future. INTRODUCTION A number of Web resources provide the public and healthcare professionals
with the most up-to-date findings in medicine, such as PubMed and MedlinePlus. Although
the access of such top-quality resources is free
and unlimited for users all around the world, most of this information
is available in English only. Non-English users therefore often encounter
great barrier of language when trying to access medical information
from these websites. In addition, most non-English consumers are not
familiar with medical terminology even in their first language. This
raises the language barrier even higher in medical information retrieval. For
example, most Chinese people know the Chinese layperson’s
term (dementia for aged people) but not the medical term (Alzheimer Disease). Currently, it is almost impossible for this population
to retrieve consumer health information they need from MedlinePlus. Thus, matching
Chinese medical terms, especially lay person’s
terms, to English medical terms becomes a critical challenge in order
to assist non-English users in finding useful medical information. Unfortunately, there
is no system providing Chinese-English cross-language
medical information retrieval (CLMIR) now.Multilingual medical thesaurus plays a crucial role in CLMIR according
to the experience of the CliniWeb1 and other CLMIR systems2,3. However, manual lexicography is time-consuming and not cost-effective. Till
now, there is still no effective method to construct multilingual
medical thesauri automatically. Most existing medical thesauri are
manually built. We proposed a new method to semi-automatically map Chinese medical terms
to Medical Subject Headings (MeSH) and construct a bilingual medical
thesaurus for Chinese-English CLMIR. MeSH is the most significant medical
thesaurus in English and has been manually translated into many languages. However, traditional Chinese version of MeSH is still not available
currently. In this study, we constructed a part of traditional Chinese-English MeSH, via
translating English medical terms in the MeSH into Chinese by using
an integrated Web-based term translation method. In the past years, we
have first proposed an integrated Web-based method that explores
two kinds of Web resources, i.e., Web anchor text4,5 and search-result pages6 to effectively deal with the problems of multilingual translation for
diverse unknown (new) Web query terms. The present study has two major goals. First, we expect that the proposed
semi-automatic method is able to help knowledge engineers to reduce
manual efforts in the difficult task of compiling Chinese-English MeSH. Second, in
the future, we will utilize the Chinese-English MeSH to
develop a practical cross-language medical meta-search engine that could
assist the laypersons to retrieve top-quality English medical information
by submitting Chinese terms. BACKGROUND We first recall previous works on automatic monolingual term mapping and
cross-language term translation. Monolingual term mapping For monolingual medical information retrieval, laypersons often encounter
a problem that their search terms are not always compatible with the
professional terms in medical documents. A number of research have focused
on dealing with such problem7,8,9. Leroy and Chen have developed a Medical Concept Mapper to help users
find medical information by providing them with appropriate medical search
terms. However, currently, the problems of cross-language term mapping
have not been emphasized in the medical domain. Parallel-corpus-based term translation In the research area of machine translation, a number of works have often
used statistical techniques to automatically extract term translations
from parallel text corpora, which contain aligned bilingual sentence
pairs10. Although the method can achieve high translation accuracy, the unavailability
of large-size parallel corpora in the medicine domain is still
stuck in a thorny situation. Comparable-corpus-based term translation Less attention has been devoted to extracting term translation from comparable
corpora, which contains texts with similar topic collected independently
in respective language communities. Fung and Yee11 used a vector-space model and took a bilingual lexicon (called seed words) as
feature sets to estimate the similarity between a word and its
translation candidates. Chiao and Zweigenbaum12 adopted similar method to find French-English translation equivalents
for new medical terms. Comparable corpora are easier to obtain, however, how
to achieve better performance for higher translation coverage is
still a challenging task. Web-based term translation As mentioned above, the conventional methods suffer from the problems of
the lack of large-size parallel corpora and the shortage of translation
coverage of comparable corpora in medical domain. Thus, we try to
apply an integrated Web-based method to effectively deal with medical
term mapping by exploring Web anchor text4,5 and search-result pages6. In the following sections, we will introduce these two kinds of Web resources
and describe how to explore these resources. METHOD AND MATERIAL Due to the limit of paper length, we can only briefly describe here our
Web-based term translation method for medical term mapping. For more
details, please refer to our previous works4,5,6. Web-based multilingual term translation Figure 1
1. Procedure To extract term translation through mining Web resources, three major processing
steps are required:
Both anchor-text mining and search-result mining follow the three-step
procedure. 2. Anchor-text mining 2.1 Anchor text An anchor text is the descriptive part of an out-link of a Web page used
to provide a brief description of the linked Web page. There are a variety
of anchor texts in multiple languages that might link to the same
pages from all over the world. For a source (unknown) term appearing
in an anchor text of a Web page, it is likely that its corresponding target translations may appear together
in other anchor texts linking to the same page. Such a bundle of
anchor texts pointing together to the same page is called as an anchor-text set. 2.2 Procedure
2.3 Probabilistic inference model Based on a multilingual anchor-text corpus, we may determine the probable
target translations for a source term by using a probabilistic model. This
model assumes that a translation candidate had a higher chance
of being a translation only if it frequently co-occurred with the source
term in the same anchor text sets. Furthermore, it assumes that the
translation candidates in the anchor texts of the pages with higher
authority may be more reliable. Hence, the similarity between a source
English term E and a Chinese translation candidate C was estimated as:
where Ui represents a web page, P(Ui) is the probability used to estimate the authority of Ui, and its definition is P(Ui)= L(Ui)/∑j=1,n L(Uj), where L(Uj) indicates the number of in-links of page Uj. The values of P(E|Ui) and P(C|Ui) were estimated by calculating the probability of E and C appearing in the anchor-text set of the Ui’s, respectively. The probabilistic inference model was proposed
to model the authority of pages, which cannot be represented by conventional
methods and yet was shown to be important to increase accuracy
of term translation4. 3. Search-result mining Even if we can collect large amounts of pages from the Web and build up
a corpus of anchor-text sets, the translation coverage of diverse query
terms is still limited to our collected corpus. To enhance the coverage
rate of term translation in medicine domains, we have exploited search-result
pages. To explore Web search results, we utilize co-occurrence relations and context
information between a source English term and Chinese translation
candidates to enhance the coverage rate of translation extraction of
unknown terms. We adopted the chi-square test and context-vector analysis
that could achieve better performance. 3.1 Search-result pages According to our observations, many Chinese search-result pages from search
engines contain rich snippets of summaries with a mixture of Chinese
and English texts. Therefore, when we search explicitly for English
terms (e.g., “Alzheimer disease”) in Chinese-language
pages from Google, it is likely that the search result will include
relevant snippets containing its Chinese translation (Alzheimer Disease), or even layperson’s term ![]() (dementia for aged people).3.2 Procedure
3.3 Chi-square test Based on co-occurrence analysis, chi-square test6 (χ2) is adopted to estimate semantic similarity between the source term E and the target candidate C. The similarity measure is defined as
where a, b, c and d are the numbers of pages retrieving from search engines by submitting
Boolean queries: “E and C”, “E and not C”, “not E and C”, and “not E and not C”, respectively; N is the total number of pages, i.e., N = a + b + c + d. 3.4 Context-vector analysis Due to the nature that Chinese pages often contain English texts, the source
English term E and the Chinese translation candidate C may share common contextual terms in the search-result pages. The similarity
between E and C will be computed based on their context feature vectors in the vector-space
model. The conventional TFIDF weighting scheme is used and defined
as
where f(ti, p) is the frequency of term ti in search-result page p, N is the total number of Web pages, and n is the number of the pages containing ti. Finally, we use the cosine measure to estimate the similarity as:
4. Combined method The anchor-text-based method is effective to extract translations of high
frequent Web query terms, while the search-result-based method has
higher coverage of translations for unknown query terms. In order to combine
the advantages of these two methods, we use a linear combination
of inverse ranks to compute the similarity measure as follows:
where αm is an assigned weight for each similarity measure Sm, and Rm(E, C) represents the similarity rank of each target candidate C with respect to its source term E and is assigned to be from 1 to k (candidate number) according to similarity measure Sm(E, C) in decreasing order. The values of the weights αm is empirically assigned as αAT = 0.39, αx2 = 0.28, and αCV = 0.33 based on our previous experiments6. RESULTS To determine the feasibility of the proposed Web-based term translation
method to help knowledge engineers reduce efforts in building the Chinese-English
MeSH by providing correct translation candidates, we first
conducted a preliminary experiment to evaluate the performance of automatically
translating the English MeSH terms into Chinese. We randomly selected two sets of 300 disease terms as the test sets from 9,646 terms
in Diseases concept of the MeSH tree structure. The average
top-n inclusion rate was adopted as an evaluation metric4. For a set of query terms, its top-n inclusion rate was defined as the percentage of source terms whose correct
translations could be found in the first n extracted translations. Table 1 shows that for the test set 1, the overall candidate matching (including
exact and partial matching) achieved 23.6%, 51.66%, and 63.9% for
the top-1, top-5, and top-10 inclusion rates, respectively. Although
the top-1 inclusion rate (i.e., accuracy for automatic
extraction) is low, top-10 inclusion rate is relatively high. The
inclusion rates in top-5 and top-10 are fairly stable across the two
data sets. Therefore, the proposed method is still effective to provide
knowledge engineers with possibly correct translations in compiling
translations. Table 2 shows some examples of Chinese translations of English MeSH terms that
were successfully extracted by the proposed method.
DISCUSSION Although the performance of exact translation was not satisfying, more
than 60% of the partially correct translations appear in the top
ten candidates. This is still very usseful in saving labor time in
constructing the Chinese-English MeSH. We developed an efficient interface, called
Chinese-English MeSH Compilation System. Figure 2 or ![]() and is popularly called (see Figure 2
We have observed the effectiveness of this system based on a preliminary
experiment. Two part-time knowledge engineers with the skills in the
area of medical information systems have compiled over 19,000 entries
of the Chinese-English MeSH (about 42,000 entries in total) during three
months. They reported that using this system not only saved them a
lot of time but also mental efforts in term mapping. The major advantage of our Web-based method is that we have no need to
use any bilingual medical dictionary while Chiao’s 12 work utilizing 4,963 seed pairs by using comparable-corpus-based method. Thus, our
method is language-independent and easy to extend to other
language pairs if the source and the target languages often appear in
the same text (e.g., Korean-English and Japanese-English6). However, the utilization of this method might be limited if the two
languages are seldom mixed in the text (e.g., French-English). Also, using
the method to translate from English Consumer Health vocabulary may
be valuable for Chinese speaking consumers. There are several directions for improvement in the future. For example, the
difference between top-10 (63.9%) and top-1 (23.6%) inclusion
rates is around 40%, showing the magnitude of potential
improvement in top-1 inclusion rate. We observed that most errors
resulted from Chinese word segmentation, medical term recognition, and
similarity computation of low-frequency terms. Our future work should
focus on these issues in order to improve the inclusion rates. Currently, we are trying to utilize the Chinese-English MeSH to develop
a prototype of cross-language medical meta-search engine, MMODE, which
could assist the laypersons to retrieve top-quality English medical
information by using Chinese terms (http://mmode.twbbs.org/mmode/). REFERENCES 1. Hersh WR, Donohoe LC. SAPHIRE International: A Tool for Cross-Language Information Retrieval. Proc AMIA Symp. 1998:673–7. [PubMed] 2. Rosemblat G, Gemoets D, Browne AC, Tse T. Machine translation-supported cross-language information retrieval for
a consumer health resource. Proc AMIA Symp. 2003:564–8. [PubMed] 3. Tran TD, Garcelon N, Burgun A, Le Beux P. Experiments in Cross-language Medical Information Retrieval Using a Mixing
Translation Module. Medinfo. 2004:946–9. [PubMed] 4. Lu WH, Chien LF, Lee HJ. Translation of Web Queries using Anchor Text Mining. ACM Transactions on Asian Language Information Processing. 2002;1(2):159–72. 5. Lu WH. Term Translation Extraction Using Web Mining Techniques, PhD thesis, Department of Computer Science and Information Engineering, National Chiao
Tung University 2003. 6. Cheng PJ, Teng JW, Chen RC, Wang JH, Lu WH, Chien LF. Translating Unknown Queries with Web Corpora for Cross-Language Information
Retrieval. Proc 27th ACM SIGIR. 2004:146–53. 7. Joubert M, Fieschi M, Robert JJ, Volot F, Fieschi D. UMLS-based Conceptual Queries to Biomedical Information Databases: An Overview
of the Project ARIANE. J Am Med Inform Assoc. 1998 Jan;5(1):52–61. [PubMed] 8. Leroy G, Chen HC. Meeting Medical Terminology Needs-the Ontology-Enhanced Medical Concept
Mapper. IEEE Transactions on Information Technology in Biomedicine. 2001;5(4):261–70. [PubMed] 9. Cimino JJ. Vocabulary and health care information technology: state of the art. Journal of the American Society for Information Science. 1995;46:777–82. 10. Gale WA. and Church KW. Identifying Word Correspondances in Parallel Texts, Proc DARPA Speech and Natural Language Workshop 1991. 11. Fung P, Yee LY. An IR Approach for Translating New Words from Nonparallel, Comparable Texts. Proc 36th ACL. 1998:414–20. 12. Chiao YC, Zweigenbaum P. Looking for French-English translations in comparable medical corpora. J Am Med Inform Assoc. 2002;8(suppl):150–4. |
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||||||||
Proc AMIA Symp. 1998; ():673-7.
[Proc AMIA Symp. 1998]AMIA Annu Symp Proc. 2003; ():564-8.
[AMIA Annu Symp Proc. 2003]Stud Health Technol Inform. 2004; 107(Pt 2):946-9.
[Stud Health Technol Inform. 2004]J Am Med Inform Assoc. 1998 Jan-Feb; 5(1):52-61.
[J Am Med Inform Assoc. 1998]IEEE Trans Inf Technol Biomed. 2001 Dec; 5(4):261-70.
[IEEE Trans Inf Technol Biomed. 2001]