pmc logo image
Logo of procamiaJournal URL: http://www.amia.org/meetings/archives.asp

Formats:

AMIA Annu Symp Proc. 2005; 2005: 475–479.
PMCID: PMC1560756
Semi-Automatic Construction of the Chinese-English MeSH Using Web-Based Term Translation Method
Wen-Hsiang Lu, PhD,1 Shih-Jui Lin, MS,2 Yi-Che Chan,1 and Kuan-Hsi Chen1
1 Department of Computer Science and Information Engineering, National Cheng Kung University, Taiwan, ROC
2 Institute of Information Science, Academia Sinica, Taiwan, ROC
Due to language barrier, non-English users are unable to retrieve the most updated medical information from the U.S. authoritative medical websites, such as PubMed and MedlinePlus. A few cross-language medical information retrieval (CLMIR) systems have been utilizing MeSH (Medical Subject Heading) with multilingual thesaurus to bridge the gap. Unfortunately, MeSH has yet not been translated into traditional Chinese currently.
We proposed a semi-automatic approach to constructing Chinese-English MeSH based on Web-based term translation. The system provides knowledge engineers with candidate terms mined from anchor texts and search-result pages. The result is encouraging. Currently, more than 19,000 Chinese-English MeSH entries have been compiled. This thesaurus will be used in Chinese-English CLMIR in the future.
A number of Web resources provide the public and healthcare professionals with the most up-to-date findings in medicine, such as PubMed and MedlinePlus. Although the access of such top-quality resources is free and unlimited for users all around the world, most of this information is available in English only. Non-English users therefore often encounter great barrier of language when trying to access medical information from these websites. In addition, most non-English consumers are not familiar with medical terminology even in their first language. This raises the language barrier even higher in medical information retrieval. For example, most Chinese people know the Chinese layperson’s term An external file that holds a picture, illustration, etc.
Object name is amia2005_0475f3.jpg (dementia for aged people) but not the medical term An external file that holds a picture, illustration, etc.
Object name is amia2005_0475f4.jpg (Alzheimer Disease). Currently, it is almost impossible for this population to retrieve consumer health information they need from MedlinePlus. Thus, matching Chinese medical terms, especially lay person’s terms, to English medical terms becomes a critical challenge in order to assist non-English users in finding useful medical information. Unfortunately, there is no system providing Chinese-English cross-language medical information retrieval (CLMIR) now.
Multilingual medical thesaurus plays a crucial role in CLMIR according to the experience of the CliniWeb1 and other CLMIR systems2,3. However, manual lexicography is time-consuming and not cost-effective. Till now, there is still no effective method to construct multilingual medical thesauri automatically. Most existing medical thesauri are manually built.
We proposed a new method to semi-automatically map Chinese medical terms to Medical Subject Headings (MeSH) and construct a bilingual medical thesaurus for Chinese-English CLMIR. MeSH is the most significant medical thesaurus in English and has been manually translated into many languages. However, traditional Chinese version of MeSH is still not available currently.
In this study, we constructed a part of traditional Chinese-English MeSH, via translating English medical terms in the MeSH into Chinese by using an integrated Web-based term translation method. In the past years, we have first proposed an integrated Web-based method that explores two kinds of Web resources, i.e., Web anchor text4,5 and search-result pages6 to effectively deal with the problems of multilingual translation for diverse unknown (new) Web query terms.
The present study has two major goals. First, we expect that the proposed semi-automatic method is able to help knowledge engineers to reduce manual efforts in the difficult task of compiling Chinese-English MeSH. Second, in the future, we will utilize the Chinese-English MeSH to develop a practical cross-language medical meta-search engine that could assist the laypersons to retrieve top-quality English medical information by submitting Chinese terms.
We first recall previous works on automatic monolingual term mapping and cross-language term translation.
Monolingual term mapping
For monolingual medical information retrieval, laypersons often encounter a problem that their search terms are not always compatible with the professional terms in medical documents. A number of research have focused on dealing with such problem7,8,9. Leroy and Chen have developed a Medical Concept Mapper to help users find medical information by providing them with appropriate medical search terms. However, currently, the problems of cross-language term mapping have not been emphasized in the medical domain.
Parallel-corpus-based term translation
In the research area of machine translation, a number of works have often used statistical techniques to automatically extract term translations from parallel text corpora, which contain aligned bilingual sentence pairs10. Although the method can achieve high translation accuracy, the unavailability of large-size parallel corpora in the medicine domain is still stuck in a thorny situation.
Comparable-corpus-based term translation
Less attention has been devoted to extracting term translation from comparable corpora, which contains texts with similar topic collected independently in respective language communities. Fung and Yee11 used a vector-space model and took a bilingual lexicon (called seed words) as feature sets to estimate the similarity between a word and its translation candidates. Chiao and Zweigenbaum12 adopted similar method to find French-English translation equivalents for new medical terms. Comparable corpora are easier to obtain, however, how to achieve better performance for higher translation coverage is still a challenging task.
Web-based term translation
As mentioned above, the conventional methods suffer from the problems of the lack of large-size parallel corpora and the shortage of translation coverage of comparable corpora in medical domain. Thus, we try to apply an integrated Web-based method to effectively deal with medical term mapping by exploring Web anchor text4,5 and search-result pages6. In the following sections, we will introduce these two kinds of Web resources and describe how to explore these resources.
Due to the limit of paper length, we can only briefly describe here our Web-based term translation method for medical term mapping. For more details, please refer to our previous works4,5,6.
Web-based multilingual term translation
Figure 1Figure 1 shows the architecture of the integrated Web-based method through mining anchor texts and search-result pages for compilation of the Chinese-English MeSH.
Figure 1
Figure 1
Figure 1
The architecture of Web-based term translation for the compilation of Chinese-English MeSH.
1. Procedure
To extract term translation through mining Web resources, three major processing steps are required:
  • Corpus collection: Collect comparable/mixed texts from the Web as a bilingual/multilingual corpus.
  • Translation candidate extraction: Extract translation candidates from the collected corpus.
  • Translation selection: Estimate the similarity for each translation candidate and determine the most possible translations.
Both anchor-text mining and search-result mining follow the three-step procedure.
2. Anchor-text mining
2.1 Anchor text
An anchor text is the descriptive part of an out-link of a Web page used to provide a brief description of the linked Web page. There are a variety of anchor texts in multiple languages that might link to the same pages from all over the world. For a source (unknown) term appearing in an anchor text of a Web page, it is likely that its corresponding target translations may appear together in other anchor texts linking to the same page. Such a bundle of anchor texts pointing together to the same page is called as an anchor-text set.
2.2 Procedure
  • Corpus collection: To make good use of Web anchor texts, we had collected 1,980,816 traditional Chinese Web pages in Taiwan, and then extracted 109,416 pages (URLs), whose anchor-text sets contained both traditional Chinese and English terms, as the anchor-text-set corpus for extracting Chinese-English translation of medical terms.
  • Translation candidate extraction: Three keyword extraction methods have been used to extract Chinese key terms from anchor-text corpus: PAT-tree-based, Query-log-based, and Tagger-based methods4. After key term extraction we select top k (k = 50) high frequent terms as translation candidates.
  • Translation selection: Use anchor-text mining to estimate the similarity based on the following model.
2.3 Probabilistic inference model
Based on a multilingual anchor-text corpus, we may determine the probable target translations for a source term by using a probabilistic model. This model assumes that a translation candidate had a higher chance of being a translation only if it frequently co-occurred with the source term in the same anchor text sets. Furthermore, it assumes that the translation candidates in the anchor texts of the pages with higher authority may be more reliable. Hence, the similarity between a source English term E and a Chinese translation candidate C was estimated as:
equation M1
(1)
where Ui represents a web page, P(Ui) is the probability used to estimate the authority of Ui, and its definition is P(Ui)= L(Ui)/j=1,n L(Uj), where L(Uj) indicates the number of in-links of page Uj. The values of P(E|Ui) and P(C|Ui) were estimated by calculating the probability of E and C appearing in the anchor-text set of the Ui’s, respectively. The probabilistic inference model was proposed to model the authority of pages, which cannot be represented by conventional methods and yet was shown to be important to increase accuracy of term translation4.
3. Search-result mining
Even if we can collect large amounts of pages from the Web and build up a corpus of anchor-text sets, the translation coverage of diverse query terms is still limited to our collected corpus. To enhance the coverage rate of term translation in medicine domains, we have exploited search-result pages.
To explore Web search results, we utilize co-occurrence relations and context information between a source English term and Chinese translation candidates to enhance the coverage rate of translation extraction of unknown terms. We adopted the chi-square test and context-vector analysis that could achieve better performance.
3.1 Search-result pages
According to our observations, many Chinese search-result pages from search engines contain rich snippets of summaries with a mixture of Chinese and English texts. Therefore, when we search explicitly for English terms (e.g., “Alzheimer disease”) in Chinese-language pages from Google, it is likely that the search result will include relevant snippets containing its Chinese translation An external file that holds a picture, illustration, etc.
Object name is amia2005_0475f5.jpg (Alzheimer Disease), or even layperson’s term An external file that holds a picture, illustration, etc.
Object name is amia2005_0475f6.jpgAn external file that holds a picture, illustration, etc.
Object name is amia2005_0475f7.jpg (dementia for aged people).
3.2 Procedure
  • Corpus collection: To obtain the search-result pages of source English medical terms, we submit them to search engines (e.g., Google). Basically, we collected page frequency of term occurrence and only the first 100 retrieved snippets to extract contextual terms as feature vectors for computing similarity between target translation candidates and source terms.
  • Translation candidate extraction: Methods to extract Chinese translation candidates from the search-result pages are the same as the methods adopted in the anchor-text mining except that the candidate number is set to k = 20 in order to reduce computation load.
  • Translation selection: Use search-result mining to estimate the similarity based on the following model.
3.3 Chi-square test
Based on co-occurrence analysis, chi-square test62) is adopted to estimate semantic similarity between the source term E and the target candidate C. The similarity measure is defined as
equation M2
(2)
where a, b, c and d are the numbers of pages retrieving from search engines by submitting Boolean queries: “E and C”, “E and not C”, “not E and C”, and “not E and not C”, respectively; N is the total number of pages, i.e., N = a + b + c + d.
3.4 Context-vector analysis
Due to the nature that Chinese pages often contain English texts, the source English term E and the Chinese translation candidate C may share common contextual terms in the search-result pages. The similarity between E and C will be computed based on their context feature vectors in the vector-space model. The conventional TFIDF weighting scheme is used and defined as
equation M3
(3)
where f(ti, p) is the frequency of term ti in search-result page p, N is the total number of Web pages, and n is the number of the pages containing ti. Finally, we use the cosine measure to estimate the similarity as:
equation M4
(4)
4. Combined method
The anchor-text-based method is effective to extract translations of high frequent Web query terms, while the search-result-based method has higher coverage of translations for unknown query terms. In order to combine the advantages of these two methods, we use a linear combination of inverse ranks to compute the similarity measure as follows:
equation M5
(5)
where αm is an assigned weight for each similarity measure Sm, and Rm(E, C) represents the similarity rank of each target candidate C with respect to its source term E and is assigned to be from 1 to k (candidate number) according to similarity measure Sm(E, C) in decreasing order. The values of the weights αm is empirically assigned as αAT = 0.39, αx2 = 0.28, and αCV = 0.33 based on our previous experiments6.
To determine the feasibility of the proposed Web-based term translation method to help knowledge engineers reduce efforts in building the Chinese-English MeSH by providing correct translation candidates, we first conducted a preliminary experiment to evaluate the performance of automatically translating the English MeSH terms into Chinese.
We randomly selected two sets of 300 disease terms as the test sets from 9,646 terms in Diseases concept of the MeSH tree structure. The average top-n inclusion rate was adopted as an evaluation metric4. For a set of query terms, its top-n inclusion rate was defined as the percentage of source terms whose correct translations could be found in the first n extracted translations.
Table 1 shows that for the test set 1, the overall candidate matching (including exact and partial matching) achieved 23.6%, 51.66%, and 63.9% for the top-1, top-5, and top-10 inclusion rates, respectively. Although the top-1 inclusion rate (i.e., accuracy for automatic extraction) is low, top-10 inclusion rate is relatively high. The inclusion rates in top-5 and top-10 are fairly stable across the two data sets. Therefore, the proposed method is still effective to provide knowledge engineers with possibly correct translations in compiling translations. Table 2 shows some examples of Chinese translations of English MeSH terms that were successfully extracted by the proposed method.
Table 1
Table 1
Inclusion rates of Chinese translation for two test sets of 300 MeSH disease terms
Table 2
Table 2
Some examples of correct Chinese translations extracted for English MeSH terms
Although the performance of exact translation was not satisfying, more than 60% of the partially correct translations appear in the top ten candidates. This is still very usseful in saving labor time in constructing the Chinese-English MeSH. We developed an efficient interface, called Chinese-English MeSH Compilation System. Figure 2Figure 2 shows the system consisting three major parts. Part 1 displays the English MeSH term and its Chinese translation after compilation. Part 2 provides knowledge engineers efficient compilation with checkboxes as well as text input button. Additionally, some auxiliary resources are added to augment the lack of translations in Part 3. The interface suggests about 30 translation candidates, which increases the chance of covering more layperson’s terms. For instance, “Down Syndrome” has Chinese Chinese translations An external file that holds a picture, illustration, etc.
Object name is amia2005_0475f8.jpg or An external file that holds a picture, illustration, etc.
Object name is amia2005_0475f9.jpgAn external file that holds a picture, illustration, etc.
Object name is amia2005_0475f10.jpg and is popularly called An external file that holds a picture, illustration, etc.
Object name is amia2005_0475f11.jpg (see Figure 2Figure 2).
Figure 2
Figure 2
Figure 2
The Chinese-English MeSH Compilation System.
We have observed the effectiveness of this system based on a preliminary experiment. Two part-time knowledge engineers with the skills in the area of medical information systems have compiled over 19,000 entries of the Chinese-English MeSH (about 42,000 entries in total) during three months. They reported that using this system not only saved them a lot of time but also mental efforts in term mapping.
The major advantage of our Web-based method is that we have no need to use any bilingual medical dictionary while Chiao’s 12 work utilizing 4,963 seed pairs by using comparable-corpus-based method. Thus, our method is language-independent and easy to extend to other language pairs if the source and the target languages often appear in the same text (e.g., Korean-English and Japanese-English6). However, the utilization of this method might be limited if the two languages are seldom mixed in the text (e.g., French-English). Also, using the method to translate from English Consumer Health vocabulary may be valuable for Chinese speaking consumers.
There are several directions for improvement in the future. For example, the difference between top-10 (63.9%) and top-1 (23.6%) inclusion rates is around 40%, showing the magnitude of potential improvement in top-1 inclusion rate. We observed that most errors resulted from Chinese word segmentation, medical term recognition, and similarity computation of low-frequency terms. Our future work should focus on these issues in order to improve the inclusion rates.
Currently, we are trying to utilize the Chinese-English MeSH to develop a prototype of cross-language medical meta-search engine, MMODE, which could assist the laypersons to retrieve top-quality English medical information by using Chinese terms (http://mmode.twbbs.org/mmode/).
1. Hersh WR, Donohoe LC. SAPHIRE International: A Tool for Cross-Language Information Retrieval. Proc AMIA Symp. 1998:673–7. [PubMed]
2. Rosemblat G, Gemoets D, Browne AC, Tse T. Machine translation-supported cross-language information retrieval for a consumer health resource. Proc AMIA Symp. 2003:564–8. [PubMed]
3. Tran TD, Garcelon N, Burgun A, Le Beux P. Experiments in Cross-language Medical Information Retrieval Using a Mixing Translation Module. Medinfo. 2004:946–9. [PubMed]
4. Lu WH, Chien LF, Lee HJ. Translation of Web Queries using Anchor Text Mining. ACM Transactions on Asian Language Information Processing. 2002;1(2):159–72.
5. Lu WH. Term Translation Extraction Using Web Mining Techniques, PhD thesis, Department of Computer Science and Information Engineering, National Chiao Tung University 2003.
6. Cheng PJ, Teng JW, Chen RC, Wang JH, Lu WH, Chien LF. Translating Unknown Queries with Web Corpora for Cross-Language Information Retrieval. Proc 27th ACM SIGIR. 2004:146–53.
7. Joubert M, Fieschi M, Robert JJ, Volot F, Fieschi D. UMLS-based Conceptual Queries to Biomedical Information Databases: An Overview of the Project ARIANE. J Am Med Inform Assoc. 1998 Jan;5(1):52–61. [PubMed]
8. Leroy G, Chen HC. Meeting Medical Terminology Needs-the Ontology-Enhanced Medical Concept Mapper. IEEE Transactions on Information Technology in Biomedicine. 2001;5(4):261–70. [PubMed]
9. Cimino JJ. Vocabulary and health care information technology: state of the art. Journal of the American Society for Information Science. 1995;46:777–82.
10. Gale WA. and Church KW. Identifying Word Correspondances in Parallel Texts, Proc DARPA Speech and Natural Language Workshop 1991.
11. Fung P, Yee LY. An IR Approach for Translating New Words from Nonparallel, Comparable Texts. Proc 36th ACL. 1998:414–20.
12. Chiao YC, Zweigenbaum P. Looking for French-English translations in comparable medical corpora. J Am Med Inform Assoc. 2002;8(suppl):150–4.

See more articles cited in this paragraph
See more articles cited in this paragraph