NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

Balk EM, Chung M, Hadar N, et al. Accuracy of Data Extraction of Non-English Language Trials with Google Translate [Internet]. Rockville (MD): Agency for Healthcare Research and Quality (US); 2012 Apr.

Cover of Accuracy of Data Extraction of Non-English Language Trials with Google Translate

Accuracy of Data Extraction of Non-English Language Trials with Google Translate [Internet].

Show details


Study Selection

Based on the frequency of non-English language publications and the languages spoken by native speakers affiliated with the Tufts EPC, we included articles in the following nine languages: Chinese, French, German, Hebrew, Italian, Japanese, Korean, Portuguese, and Spanish. We planned to also include Russian, but we were unable to locate a source of Russian language article PDF or HTML files. The languages were chosen based on a combination of their frequency among articles in PubMed (Table 1) and the availability of past and present Tufts EPC research associates and physician-investigators who are native or fluent speakers of the non-English languages and who have expertise in systematic review and data extraction.

Using QUOSA Information Manager™ (v 8.07.265, QUOSA, Inc.) software, which allowed us to search in PubMed and automatically retrieved available PDF files, we searched with the term “randomized controlled trial,” restricted separately to each of the 10 languages (initially including Russian). This tool can retrieve PDF files from all journals for which the Tufts Health Services Library has a subscription or that are publicly available. We accepted the first 10 publications in each language, regardless of topic, for which either a machine readable PDF or HTML file was available for the full text of the article. We accepted only studies with these file types since otherwise they could not be translated with Google Translate. Full-text articles were screened by the researcher who was native in that language to determine eligibility. Eligible studies were randomized controlled trials (RCTs) that reported per-treatment group results data (with the exception of Hebrew language studies, see below). We excluded publications that had a simultaneous English translation in the PDF or HTML file. We also excluded publications that were not primary reports of RCTs (but were summaries of English-language RCTs). When necessary, we found additional articles from QUOSA to obtain 10 eligible studies per language. When we were unable to find sufficient available trials in a language, the researcher who was native in that language searched country- or journal-specific online databases for relevant studies (e.g., the Korean medical literature database or the Israeli journal Harefuah). Upon review of the Hebrew language literature, we found no RCTs in a suitable file format. Therefore, for Hebrew, we included any study that had any comparison between two groups of study participants (whether an intervention or a participant characteristic such as age).

In addition, we chose 10 English-language RCTs to use as a reference standard. These were RCTs that were previously extracted by one of the team members for another systematic review project that included both a continuous and a categorical outcome.


Each article was translated into English using Google Translate. This was done with the simplest method possible for each PDF (or HTML) file. Depending on the format of the articles, the English translations included the original tables and figures, translated the best they could be. We did not copy over any English language abstracts that were published with the original articles, but we did copy over English language tables and figures. Each article was translated into a separate Word, PDF, or HTML file that could be accessed without seeing the original article. Translations were performed by the project lead and the research assistant. Where feasible, we translated articles from languages we could not read. A rough estimate of the time required to extract each study was tracked.

Basic Instructions Compiled for Article Translation

The following are the basic instructions we compiled for internal use to perform article translation. They assume the use of a Microsoft Windows™ operating system. They are not meant to be comprehensive instructions.

  1. If you are working from a PDF
    1. Under the large text box, in light blue, click “translate a document”
    2. Browse to the relevant PDF/HTM.
    3. Pick the From language.
    4. Click Translate
  2. Save the translation as an HTML file.
  3. Google translate seems to maintain the formatting, particularly of tables, much better when it’s working off a Web site (HTM/HTML file) than a PDF document.
    1. If sections (particularly tables or figures) are not clear, go to the original file and follow the directions in steps 5 & 6 (for those sections or the whole article)
  4. If the automatic translation fails
    1. Copy text (paragraph by paragraph, column by column, or page by page, whichever works cleanly) into a Word document
      1. Care needs to be taken in some languages (e.g., Hebrew) where the direction of text may be different than English
    2. Clean up the Word file as necessary (e.g., remove inappropriate line breaks within sentences—particularly for Asian languages, remove hyphens if necessary)
    3. Copy sections or the whole cleaned up text into the large text box in Google Translate.
    4. Copy the translated text back into a Word document and save.
    5. For tables and figures with translatable text (text that can be copied), enter the translations into the appropriate cells in a newly created shell of the table or otherwise indicate which original language text aligns with which translation.
  5. If an article consists of blocks of text images (as from scanned documents) for which a machine cannot read lines of text, transformed these images into text by applying an optical character recognition (OCR) process on the file. Then attempt to translate with step 5.
    1. This approach is likely to work only for languages with Latin alphabets
  6. If all translations (or all attempts to copy) from a language fail—particularly those with non-Latin alphabets, you may need to “Install files for complex script and right-to-left languages” or make other modifications to your PC under Regional and Language Options/Language in the Control Panel.

Data Extraction

Data from the original language versions of the articles was extracted by the native speakers. These included two current physician-investigator members of the EPC (French [ID]*, German [KU]), four physician-investigators formerly associated with the EPC (French [GK], Italian and Spanish [JC], Japanese [TT], Portuguese [LZ]), three current EPC research associates (Chinese [MC, WY], Hebrew [NH]), and one former EPC research associate (Korean [JL]). Whenever an article included an English version of the abstract in the original version, extractors of the original language version were instructed to ignore the English version of the abstract.

The English translated versions of the articles were extracted by one of five researchers who did not speak the given language (one physician-investigator [EB] and four research associates [MC, NH, KP, WY]), all currently within the EPC. The extractors of the English language versions were distributed across languages to avoid pairing of original and English language data extractors. Original and English language data extractors were not allowed to review each others’ extractions.

With this design, any lack of agreement between the original and English-translated versions can be attributed to either errors in translation or differences between pairs of extractors. To obtain some information on between-extractor variability, the five within-EPC extractors [EB, MC, NH, KP, WY] double-extracted 10 English-language RCTs. Specifically each extracted two English language articles they had previously extracted for a prior systematic reviews and two other English language articles they had never seen before.

Data Extraction Form

Since we were primarily interested in the accuracy of the data extraction, as opposed to the accuracy of all the text, we performed limited data extraction on those study features that are most important for assessing the study characteristics, methods, and results (see Appendix A for the data extraction form). We limited study quality-related features to objective measures to minimize subjective evaluation of the studies by the data extractors. We extracted the following information: the eligibility criteria, descriptions of the interventions and control, sample size, duration of followup, descriptions and definitions of selected outcomes, the reporting of randomization and allocation concealment techniques, use of blinding, use of intention-to-treat analyses, the reporting of power calculation, and results for selected two outcomes, including baseline value, followup value, mean change, relative effects, confidence intervals, and P values. The selection of outcomes for results data extraction was based on type of data (categorical or continuous), the location of reporting (abstract or full text only), and the completeness of reporting (e.g. mean with standard deviation, per-treatment group data, pre- and post-treatment data). Whenever possible, we selected one categorical outcome and one continuous outcome from each trial, and one outcome that was presented in the abstract (and the full text) and one presented in full text only. We focused on outcomes for which there were direct comparisons between interventions; this approach emphasized effect outcomes. We, thus, mostly excluded adverse events, except where they were reported for all interventions.

The English language extractor was also asked how much additional time was needed to extract translated articles (compared to what it might have taken to extract an equivalent English language article) as “none,” “a little” (up to about a half-hour), or “a lot.”

Data Extraction Comparison

For each study, a single researcher [EMB], with the assistance of a research assistant compared pairs of data extraction forms. The research assistant compared the straightforward pieces of data. The project lead confirmed these and compared extractions of the more clinically or methodologically difficult data (e.g., eligibility criteria, P values). The original plan was to compare each item in the data extraction form, then for each study, to ask each data extractor to confirm any data from their version of the article for any piece of data for which there was a discrepancy. The pairs of data extractors would then meet to review remaining discrepancies and to come to agreement whether each discrepancy was due to language differences or other reasons. However, four modifications had to be made.

First, the data items from the extraction form were consolidated for the purposes of data comparison (see the annotations in Appendix A). For example, the various types of eligibility criteria asked for were condensed into simply “inclusion criteria” and “exclusion criteria.” Other data items were not analyzed because of lack of relevance or because of wide-ranging disparities in interpretation by the data extractors (e.g., washout period, other blinding methods).

Second, regardless of how many items were extracted, we analyzed (compared) only one intervention, one comparator, the listing of up to five outcomes, the results for one categorical outcome, and one continuous outcome. We chose the first outcomes listed by the original language extractor.

Third, the data reconciliation between data extractors was reduced to simply asking the English-language data extractors (who are all active members of the EPC) to add or confirm data that were missing (compared with the original language extraction) or in the judgment of the project lead required some clarification to assess whether the translation was adequate. In rare instances, the original language extractors (who were mostly off-site) were also asked to fill in missing data; however, in most instances of data missing from the original language extraction, the data item was excluded from the comparison. Exceptions were made, when in the judgment of the project lead the missing data meant “no data” or the English language extraction was sufficiently clear and coherent to be assumed to be accurate. This modification was made both because the volume of data mismatches was so large as to make this step highly time-consuming, and because most of the non-English extractors were off-site (with up to 13 hours time difference), and their availability became limited.

The fourth modification further allowed the researcher doing the data comparison to use his judgment to assess the data extraction forms in toto to determine whether there was agreement or not. Examples included making negative inclusion criteria (e.g., not male) to be equivalent to exclusion criteria (female), determining that “no” and “no data” were equivalent, determining whether swapped treatment and comparator was due to arbitrary selection by the extractor or poor translation, and determining whether the P values alternatively extracted as either within or between differences were the same or not. Because of the judgments involved in much of the data comparison, a single researcher (the project lead) made the final comparisons for all studies. This was done to maintain consistency across studies.


We calculated the simple percent agreement (items in agreement/total items) as the outcome metric for the analyses. We analyzed percent agreement within sets of studies in each language for each item and for groups of items based on the “tables” on the data extraction form (see Appendix A): eligibility criteria (extraction form table 1; 2 items) ; intervention and comparator combined (extraction form tables 2a, 2b, 3, 3a, and 3b; 12 items); design (extraction form table 5; 4 items), quality issues (extraction form table 6; 9 items); outcomes (extraction form table 7; 7 items); categorical results (extraction form table 8; 9 items); and continuous results (extraction form table 9; 27 items). Histograms of the percent agreement for all items together and for each category group within each language (including English) were graphed so that comparisons could be made across languages. The English language study comparisons acted as a reference standard to compare the degree of agreement we achieved by extracting data from English language articles with the degree of agreement for each language. We did not use kappa statistics because the large majority of items were not dichotomous. In general, we were comparing descriptions (e.g., “inclusion criteria”).

We first performed Mann-Whitney tests to compare the distribution of agreements across all extraction items for each foreign language (separately) and English language extraction. We repeated the same test for each category of items between each foreign language and English language. Based on the observed distribution of our reference standard (i.e., English language), we defined “good agreement” as greater or equal to 80 percent agreement. We performed the Fisher’s exact test to assess the differences in the percentage of items that reached “good agreement” between each foreign language and English language, across all categories, for each category of items, and for each language set of studies (the percentage of studies that had >80 percent agreement within each study).

Analyses were conducted with Stata SE 11 software (Stata Corp., College Station, Texas). All P values were 2-tailed, and a P value less than 0.05 was considered to indicate a statistically significant difference. We did not adjust for multiple testing. The researcher performing the comparisons also collected examples of obvious causes of disagreements between original language and English extractions.



Initials in brackets refer to the study investigators (authors) or acknowledged colleagues.

These tables refer to the “tables” in the data extraction form, not the Tables in this report.


Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...