NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.
Paynter R, Bañez LL, Berliner E, et al. EPC Methods: An Exploration of the Use of Text-Mining Software in Systematic Reviews [Internet]. Rockville (MD): Agency for Healthcare Research and Quality (US); 2016 Apr.

EPC Methods: An Exploration of the Use of Text-Mining Software in Systematic Reviews [Internet].
Show detailsOverview
We searched 18 databases (please see Appendix A for the search strategies) and retrieved 1,473 records. We identified 122 relevant publications. We provide a narrative synthesis of the literature below. Most of the literature we reviewed studied text mining within a single context in the systematic review process. Given the lack of overlap, we present our findings in their typical order within the systematic review process: searching, screening, data extraction, critical appraisal, and updating. Following the literature review section, we summarize the Key Informant interviews and summarize the information from our review of individual text-mining tools.
Literature Review
Searching
Information retrieval is one of the earliest tasks within the systematic review process and has a profound impact on the review's comprehensiveness. A challenge librarians face is identifying the universe of concepts, text words, and controlled vocabulary terms relevant to the review topic. The search strategy's quality depends on the librarian's experience and skill. As Hausner et al. note, concept-based approaches are subjective and depend on the information specialist's knowledge of the topic under investigation. Given the complex nature of many topics, it is difficult to know when a strategy is complete.30
One way text mining is applied within the search stage of a systematic review is identification of keywords and controlled vocabulary terms for the search strategy. Typical strategy development involves exploratory searches followed by scrutiny of keywords and indexing by information specialists. Although effective, this process is time-consuming and limited by the librarian's understanding of the topics and controlled vocabularies. It is also difficult to capture this iterative process in the review documentation and thereby affects the transparency of the review process.
In our results set, the most common use of text mining in the systematic review search process was objective topical filter development.30-41 The specific topics are noted in Table 3. Although the topics studied are not directly related, they have a couple of features in common: They share a complex and diffuse nature that is not covered well by current controlled vocabularies used to index bibliographic databases, and they are also multidisciplinary and require searches of diverse resources to ensure comprehensive retrieval.
Table 3
Topical filters.
There were several general approaches to developing strategies in the literature we reviewed. The first approach assessed word frequency in citations as presented by a stand-alone application. Tools such as PubReMiner provide a user interface to analyze PubMed output.42 The program generates frequency tables from the results set outlining the number of records by text word, controlled vocabulary heading, year, substances, county, etc. Balan,34 Kok,35 and Hausner30 used this approach in their studies. Tanon,33 Petrova,36 Hausner,30and Poulter43 used EndNote, a citation management application, to generate word frequency lists. This technique is limited to use with words appearing in citation titles, abstracts, and controlled vocabulary terms.
The second approach is automated term extraction.6 This approach can also be used with citations, abstracts, and controlled vocabulary terms but is extensible to the full text of documents. These tools also generate word frequency tables, but many are limited to single word occurrences. This limits the utility since many controlled vocabulary terms are phrases and the relevance of single text words is best assessed in context. Tools such as Antconc, Concordance, and TerMine extract phrases and combination terms. Programs such as MetaMap and Leximancer add a semantic layer to the process by using tools provided through the National Library of Medicine's Unified Medical Language System. Words and phrases identified in the corpus are expanded through mapping to Metathesaurus concepts and may be clustered according to semantic relationships associated with the concepts.38
The tools described in the literature we reviewed (see Table 3, Table 4) employ different algorithms. However, the overall approaches were similar. The first step is creating a developmental set to train the text-mining application. Several methods were used to generate these sets. The most common was creating a corpus of included references from completed systematic reviews on the topic of interest.31,32 Variations of this approach included manually created sets based on author knowledge and curated bibliographies, reference sets from clinical practice guidelines, and PubMed click-through data.33,35-38,44 In addition to the training set, another corpus representing the general literature, usually created by randomly sampling citations from PubMed, is also presented to the algorithm. Only words and phrases that are “overrepresented” in the training set are considered for inclusion in the search strategy. For example, Simon et al. included terms from the development set that were prevalent in two percent or less of the population set.32
Table 4
Searching.
This approach has inherent problems. Petrova et al. note that “the reported frequencies for text words did not necessarily reflect the number of abstracts in which a word appears. It is the latter that would be a true indicator of sensitivity.”36 Also, “the term extraction algorithm depends on the content of the documents supplied to it by the reviewer.”31
Most study groups took a diagnostic framework approach and reported the recall, precision, and number needed to read for their objectively derived strategies. Gold standard comparator groups were generated using PubMed HSR Queries, existing curated subject bibliographies, and strategies used to create existing systematic reviews.32,33
Study results, text-mining tools reported in the studies, and other data are presented in Table 4.
All the studies represented in Table 3 found benefit in automating term selection for systematic reviews, especially those comprising large unfocused topics. Balan et al. found that “the benefits of text-mining are increased speed, quality and reproducibility of text process boosted by rapid updates of the results.”34 They also found that text-mining “revealed trends in big corpora of publications by extracting occurrence frequency and relationships of particular subtopics.”34 Petrova et al. similarly noted, “Word frequency analysis has shown promising results and huge potential in the development of search strategies for identifying publications on health-related values. Other “diffuse topics, such as change (both in healthcare organizations and of health behaviors), communication, social support, learning, and teaching may also lend themselves to effective exploration for the purposes of search strategy design through these or similar techniques for the field of the health information sciences.”36 In their studies, Hausner et al. proposed and validated that the objective approach to creating search strategies was noninferior to the manual conceptual approach.30,46 Search documentation for reviews using the automated approach include the word frequency tables and seed references included in the test set. The group is currently running a prospective head-to-head study comparing these methods.
Text mining can be incorporated at various points in search strategy development. Although most of the literature describes identification of keywords for the strategy, Choong et al. suggest an automated text-mining approach to “snowballing.” “Snowballing” is the process in which relevant references cited in retrieved literature are added to the search results and usually is performed after the main literature search is completed. Choong et al. found that “Snowballing is automatable and can reduce the time and effort of evidence retrieval. It is possible to reliably extract reference lists from the text of scientific papers, find these citations in scientific search engines, and fetch the full text and/or abstract.”45
Although it seems promising, text mining has not become a standard tool for creating systematic review search strategies. Simon et al. note that “the described development process for an empirical search strategy is a useful – though technically demanding – approach to building performance-oriented strategies.”32 Balan et al. concluded, “Methodologically speaking, we conclude that text-mining was helpful in getting an overall perspective on a huge corpus of literature with some level of detail, intentionally limited to handle complexity. Richer information can be extracted using more complex text-mining methods focused on narrower topics, but this requires extensive training and knowledge.” They also commented that “A decision factor to use text-mining relates to how profitable and how difficult the tools may be.”34
One common limitation we observed in the literature was that many of the tools depend on output from PubMed/MEDLINE. Citations retrieved from this resource are important for systematic reviews but do not represent the entire population of literature relevant for health-care-related systematic reviews. Other limitations are related to the nature of the literature base itself. For example, extraction tools that do not use semantic expansion may miss relevant studies. Damarell et al. found that although their filter improved recall for heart failure–related topics, some studies were missed because they mentioned specific symptoms/syndromes rather than the underlying condition.37
Most authors recommend incorporating text-mining processes as an adjunct to employing experienced information professionals. O'Mara-Eves et al. conclude that text mining “should never be used on its own but rather in conjunction with the expertise and usual processes that are followed when developing a search strategy.”31 Interestingly, some authors argue that when an objective approach to text-mining is applied, further approaches such as obtaining expert knowledge or reading background literature may no longer be necessary to develop reliable search strategies.30,46
Screening
After searching, the next step in the systematic review process is screening the retrieved citations for relevancy to the research questions. This requires analysts to review each retrieved item and compare it to a predetermined list of inclusion and exclusion criteria. The full text of included citations is obtained for further review, data abstraction, and analysis.
O'Mara-Eves et al. published a systematic review on this topic in January 2015.14 Because of this review's currency and comprehensiveness, we are using it as the basis for our review of text mining in the systematic review screening process.
The O'Mara-Eves et al. review comprises 44 studies (27 retrospective studies, 17 prospective studies). Across these studies, text mining was incorporated into the screening process for multiple purposes. One major use was prioritizing citations for manual screening. This had the advantage of human review of all the citations, but it provided efficiencies by presenting the most relevant citations first. This concentrated the document-retrieval activity earlier in the process so data abstraction could proceed in tandem with review of the machine-designated “less relevant” citations. Some programs used visualization methods to group “like” citations. This allowed researchers to more rapidly assess the citation groups and make inclusion/exclusion decisions. Another variation on this method was rating the difficulty of screening individual citations. More challenging citations would be assigned to more experienced researchers, again speeding the overall process.
Some studies reported using text-mining techniques for automated citation inclusion/exclusion decisions. Most commonly, the automated screening would fulfill the role of second screener to meet recommendations for dual screening of citations.
As mentioned in the searching section of this report, text mining is highly dependent on the set of citations used to train the algorithm. O'Mara-Eves et al. define active learning as “an iterative process whereby the accuracy of the predictions made by the machine is improved through interaction with reviewers.”14
Creating training sets for systematic review screening provides challenges not present in other text-mining use cases. Because of the comprehensive nature of systematic reviews, search retrieval tends to include many more irrelevant than relevant citations, leading to “imbalanced datasets.” This problem has been addressed several ways. One approach is assigning greater weight to included citations than excluded citations in the training algorithm. Another approach is using under-sampling techniques, which can be done randomly or aggressively. Aggressive under-sampling ranks excluded citations in terms of similarity to included citations. Those most similar are thrown out of the set, skewing the remaining set. This ensures that equivocal citations will be included and helps prevent false-negative exclusions.
False negatives (deeming a citation irrelevant when it should have been included in the review) are more problematic than false positives since these publications can be excluded at the full article review stage. One method of managing this problem is implementing “voting or committee approaches for ensuring high recall.”14 This can be implemented by running multiple classifiers simultaneously and counting the “votes” for inclusion or exclusion. Disputed items can be forwarded for manual review. Another approach is including the citation if any classifier recommends inclusion. O'Mara-Eves et al. note that implementers of text-mining algorithms should “consider whether the amount and/or quality of the training data make a difference to the ability these modifications to adequately penalize false negatives. The reason for this is that, if used in a ‘live’ review, there might be only a small number of human-labelled items in the training set to be able to determine whether the classifier has incorrectly rejected a relevant study.”14
Another training problem is that a set of citations may not be representative of the entire population of relevant documents. This imposes a risk of “hasty generalization.” Processes recommended to avoid this problem are incorporating reviewer domain knowledge and employing patient active learning. In this approach, classifiers are targeted on different “views” of the citations such as titles, abstracts, and controlled vocabulary terms. O'Mara-Eves et al. noted that human input resulted in a decline in recall when active learning was added to a support vector machine or decision tree classifier but made no difference to the recall of a naïve Bayes classifier. They found this intriguing and recommend further research in this area.14
Creating training sets for systematic review updates presents unique problems. Although it may seem an easier task because there is already a set of included citations for training the algorithm, concept drift may have occurred. Concept drift is a phenomenon in which “data from the original review may cease to be a reliable indicator of what should be included in the new one.”14 Training sets might not be representative of those available when conducting a “new” review. Also, biases may have been introduced by overly inclusive reviewers for the report's previous iteration.
Where possible, the 44 studies in the O'Mara-Eves et al. systematic review were evaluated for workload reduction. The authors evaluated the algorithms or text-mining methods employed in the included studies. Within this umbrella are classifiers and the options for using them (kernels) and feature selection for the algorithms (titles, abstracts, MeSH headings), including the effect of different combinations on performance. They also evaluated the effectiveness of methods for implementing text mining. These metrics include the F measure (harmonic mean of precision [positive predictive value] and recall [sensitivity]), work saved over sampling (WSS), and utility. Reported evaluation metrics had subjective elements, which made it difficult to compare across studies. Individual study results are available in the O'Mara-Eves systematic review.14 Almost all papers considered text mining a promising method to reduce workload during screening.
O'Mara-Eves et al. suggested elements for consideration before broadly implementing text mining. First, the program should be available to systematic reviewers without the need for a computer scientist to write code or process text for individual reviews. At the time of fact checking, the authors identified only six such systems:
- Off-the-shelf for systematic review:
- Abstrackr
- EPPI-Reviewer
- GAPScreener
- Revis
- Generic – require some training
- Pimiento
- RapidMiner
Replicability, scalability, and suitability should also be considered. Only one study reported in the review was a replication study. Although some studies used the same dataset, it was impossible to directly compare the studies. Scalability is still questionable. The evaluation datasets were relatively small compared with typical systematic review retrieval sets. With few exceptions, most datasets included fewer than 5,000 citations. Suitability also requires additional study. Only a few types of evidence bases have been evaluated to date, mostly in the domains of biomedicine and software engineering.
O'Mara-Eves et al. conclude, “On the whole, most [studies] suggested that a saving in workload of between 30% and 70% might be possible (with some a little higher or a little lower than this), though sometimes the saving in workload is accompanied by the loss of 5% of relevant studies (i.e., a 95% recall).”14 They noted that the approaches so far have been based on citations, abstracts, and metadata rather than full text. They recommend:
- Systematic reviewers should work together across disciplines to test these approaches.
- Text mining for prioritization is ready for implementation.
- Text mining as a second screener may be used cautiously.
- Text mining as the only means of excluding articles is not yet ready for use.
One of the tools the O'Mara-Eves review mentions is worthy of additional discussion since it was developed by members of an EPC. The Abstrackr development team has multiple publications tracking the evolution of Abstrackr.49-51 Abstrackr uses a semi-automated screening algorithm that incorporates labeled terms and timing data into an active learning framework. The algorithm was developed for imbalanced datasets and is intended as an add-on to manual processes. It uses a pool-based active learning approach using the LibSVM support vector machine. The SIMPLE active learning strategy trains the algorithm by presenting the most ambiguous citations for labeling first. It continues presenting citations until a predefined stopping criterion is met. After experimentation, the developers selected 50% as the cut-off point. As of 2012, the developers had used Abstrackr in more than 50 systematic reviews.
Rathbone et al., at the Centre for Research in Evidence-based Practice in Australia, have also studied Abstrackr. Their study included four systematic reviews representative of different types of evidence bases (diagnostics, multiple-intervention, small homogenous set, multiple study types), and their metric was workload savings. The authors chose Abstrackr for evaluation over other text-mining tools because “existing literature indicates that the recall accuracy of Abstrackr is very high… and therefore, a promising predictive text-mining tool for systematic reviews where the primary goal is to identify all relevant studies.”52 The authors conclude that “Semi-automated screening with Abstrackr can potentially expedite the title and abstract screening phase of a systematic review. Although the accuracy is very high, relying solely on its predictions when used as a stand-alone tool is not yet possible. Nevertheless, efficiencies could still be attained by using Abstrackr as the second reviewer thereby saving time and resources.”52
Data Extraction
After the full-text articles have been retrieved and the inclusion decision verified, members of the systematic review team begin extracting data elements relevant for their review topic. Since data abstraction is a form of information extraction, this process has also been studied in the context of text mining.
Information extraction can include name entity recognition (concept extraction) and association (relationship) extraction. Jonnalagadda et al. published a systematic review focused on automating data extraction in systematic reviews in June 2015.15 This section will focus mainly on this work, with the addition of several studies that may be of specific interest to the AHRQ EPCs.
The Jonnalagadda et al. review comprises 26 studies. The authors created a table of extracted elements as identified in several systematic review standards and determined which elements had been extracted in the studies they reviewed. The “standards” include:
- Cochrane Handbook for Systematic Reviews
- PICO (Population, Intervention, Comparison, Outcomes Framework)
- PECODR (Patient-Population-Problem, Exposure-Intervention, Comparison, Outcome Duration and Results Framework)
- PIBOSO (Population, Intervention, Background, Outcome, Study Design, Other Framework)
- STARD (Standards for Reporting of Diagnostic Accuracy initiative)
- CONSORT (The Consolidated Standards of Reporting Trials)
Various studies had extracted population-related elements, including the total number of participants, demographic information (age, ethnicity, nationality, sex), and condition-related elements such as comorbidity and spectrum of presenting symptoms. Intervention-related elements included specific interventions, intervention details, total number of intervention groups, and current treatments for the condition. Outcomes-related information included both collected and reported outcomes and time points. Additional elements included:
- Comparators
- Sample size
- Overall evidence
- Generalizability
- External validity
- Research questions and hypotheses
- Study design
- Total study duration
- Sequence generation
- Allocation sequence concealment
- Blinding
- Methods for generating allocation sequence and implementation
- Key conclusions of study authors.
The Jonnalagadda et al. review lists an additional 28 elements, which have not yet been the subject of data extraction studies.
The accuracy of results was measured with the F metric. Studies reported data abstraction at the sentence, abstract, and full-text levels using a variety of approaches, including:
- Conditional random fields (lexical, syntactic, structural, sequential data)
- Multiple supervised classification techniques (MeSH semantic type, word overlap with title, punctuation marks on random forests, naïve Bayes, support vector machines, multi-layer perceptron classifiers)
- Naïve Bayes classifier and structured abstracts
- Statistical relational learning-based approach (kLog)
- Multistep processes:
- Infer latent topics from documents, use logistical regression to determine probability that a criterion belongs to a topic
- SVM classifier identification of sentences followed by manually crafted extraction rules
In many studies, the elements were identified and highlighted but not extracted from their context.
The F-score varied greatly between element types and studies. The review authors were unable to compare between studies because of the heterogeneity in data sets and methods. They conclude: “Most of the data elements that would need to be considered for systematic reviews have been insufficiently explored to date, which identifies a major scope for future work.” They suggest that automated data extraction might initially be useful to validate single reviewer manual data extraction followed by automated extraction as technology evolves.15
Specific Examples
Multiple studies addressed the use of text mining for detecting bias. Marshall et al. discuss use of the RobotReviewer tool to assess risk of bias using domains defined by the Cochrane Risk of Bias Tool.53 They used a multitasking systems vector machine approach that purportedly exploits correlations between bias types. Their tool gauges whether a report is at a low risk of bias and extracts supporting statements. They used a nontypical approach to create their training set. Rather than using a curated set, they used structured data in existing databases within the Cochrane Library. They selected the Cochrane databases because they are rich in terms for bias assessment. The authors concluded that the tool was not ready to replace human review but could help prioritize assessment and could potentially allow review by one analyst rather than two. A major limitation is that the tool provides one risk-of-bias assessment for the paper rather than by outcome.
Hsu et al. focused on extracting statistical analyses using their automated sequence annotation pipeline (ASAP).54 Their approach incorporates three annotators: concept, statistics, and clinical trials and requires full-text articles. ASAP runs all three annotators concurrently. The authors found that it was inconsistent in capturing variability in independent and dependent variables and hypothesis testing. They conclude: “Our system is a step towards automating the identification of key reported statistical findings that would contribute to the development of a Bayesian model of a complex disease.”54 They note that while tools exist to extract data from the abstract only a small portion of the relevant information is reported there. They also noted that these tools do not provide the context necessary to interpret the extracted information. “We attempt to not only classify sentences related to the statistical analyses, but also characterize the values reported in these sentences to populate the data mode. This allows the computer to assist in assessing the validity of reported information and enables this information to be used for meta-analysis and probabilistic disease modeling.”54 ASAP coordinates published study information with information from the protocol in the Clinicaltrials.gov record.
Shao et al. also used ClinicalTrials.gov data to address bias.55 Their Aggregator clustering tool is designed to detect multiple publications derived from the same trial. Their study was based on a set of Medline articles containing one or more National Clinical Trial (NCT) registry number. There were two training sets. The positive set comprised articles with the same NCT numbers, while the balanced negative set comprised articles with the same conditions and interventions but different NCT numbers. The classifier was focused on multiple features, including:
- Rank in related articles
- Number of shared author names
- Affiliation similarity
- Shared email
- Publication type similarity
- Support type similarity (grants)
- Email domain
- Shared country
- Shared substance names
- All-capitalized words in title
- All-capitalized words in abstract or CN field
The training set reached 0.881 precision, 0.813 recall with F1=0.843. The validation set (composed of citations from 5 Drug Effectiveness Review Project systematic reviews) was calculated to have 0.877 precision, 0.833 recall with F1=0.854. They encountered two types of errors: Splitting errors occurred when the model missed articles from the same trial discussing different aspects of the study. Lumping errors occurred when the model incorrectly identified publications with shared authors and topics but different trials. While still a model, the authors plan to incorporate it into their pipeline tool.55
Cohen et al. describe the randomized controlled trial (RCT) tagger.56 This tool predicts whether a study is an RCT based on the citation, abstract, and MeSH headings. The model can be used with or without the MeSH headings. RCT tagger is a web-based tool (http://arrowsmith.psych.uic.edu/cgi-bin/arrowsmith_uic/RCT_Tagger.cgi) and returns a list of abstracts from PubMed with an RCT confidence rating. It shares the pipeline with Aggregator so the RCTs can be further analyzed for trial source within the same search session. Cohen et al. found that many RCT citations were not classified with the RCT tag and vice versa.
Updating
After a systematic review has been completed and published, one major challenge is determining whether changes in the evidence base necessitate updating the report. Cohen et al. have addressed this problem in 2008 and 2012 publications.57,58 The authors envision an automated alerting system that notifies a team that a study likely to meet inclusion criteria has been published as soon as that publication has been indexed in Medline. They found that “review experts are more willing to trade off recall for precision for the New Update Alert task, as compared to the work prioritization task that we have previously studied. In particular, the principal investigator of the Drug Effectiveness Review Project (one of the senior authors of this paper) consistently preferred a recall of 0.55 and the achievable precision corresponding to that level of recall over all other available levels of recall between 0.99 and 0.55.” Although recall consistently reached 0.55 in the training set, it varied from 0.134 to 1.0 in the test topics. The authors attributed this to small sample sizes in the test sets; the largest topic set achieved recall of 0.50. They believe that a systematic review expert using a live alert system could use this approach effectively. They also note that it could be useful for prioritization of review updates between topics since it would facilitate comparing the number of citations that meet alert criteria as opposed to the gross number of citations captured in the search alerts.
Dalal et al. also considered text mining for report updating.59 Their training set retrieved only PubMed citations that had been indexed with MeSH headings for simulated comparative-effectiveness research reports. The authors “evaluated statistical classifiers that used previous classification decisions and explanatory variables derived from MEDLINE indexing terms to predict inclusion decisions. This pilot system reduced workload associated with screening two simulated comparative effectiveness review updates by more than fifty percent with minimal loss of relevant articles.”59
Key Informant Interviews
We interviewed eight Key Informants (KIs). To get a range of views on the implementation and use of text mining, we set out to interview two different groups: four senior investigators representing an organizational perspective and four librarians for a research team member perspective. Because the preponderance of published literature to date has focused on the use of text mining in the screening phase, our workgroup decided to focus on the searching phase and interviewing librarians to gain a fresh perspective. Below, we provide a narrative summary of integrated findings for each group. Please see Appendix C for question themes and quotes from each group and Appendix D for KI comments on specific text-mining tools.
Summary of Integrated Findings
Senior Investigator/Organizational Perspective
Motivation to use text-mining tools: Most of the KIs were interested in process improvement but came to it from different angles (e.g., medical versus social science topics, systematic review versus scoping review), with most of the comments relating to the use of text mining in the screening process to overcome problems associated with large result sets.
Cost and time efficiencies: Software and staffing costs to create a text-mining tool seemed difficult to calculate for the interviewed KIs because many of the tools they use have been developed over a long time, working alone or in conjunction with colleagues and existing staff. Two of the KIs use text-mining tools to prioritize records for screening, so records with the highest probability of being on-topic are shown at the beginning of the result set; thus, research team members can begin abstracting and analyzing included records earlier in the systematic review process than would be typical if research teams had to wait for the screening process to be completed first. One KI was involved in a large-scale scoping review in which text mining allowed their team to complete the project; it would otherwise have proved impossible with a traditional screening approach.
“Integratability” into existing workflows was mentioned by two KIs, both in terms of making it easier for staff to work without moving between multiple systems to complete a task and creating a user-friendly, front-end interface because many text-mining tools otherwise require some technological expertise to run.
Organizational and technological facilitators: Perhaps not surprisingly, organizational leadership seemed to be the critical factor in the decision to move forward with implementation. Information technology (IT) infrastructure and IT staff support varied across organizations. Three KIs noted its importance to the success of their projects, while the fourth KI lamented that no specific organizational budget line existed to aid its development.
Organizational and technological barriers: KIs mostly reported high staff acceptance to adopting text mining; however, staff were also mentioned as an organizational barrier, specifically librarians/information specialists who may feel their work has been deskilled. Two KIs expressed concerns regarding the systematic review community's reception of text-mining/machine learning use to support reviews because of human decision-making over computer bias. While KIs were generally optimistic about the future integration of text-mining tools into systematic reviews, two expressed some hesitation—that at present they should not be used blindly, but rather with a knowledge of their strengths and limitations because they are in their infancy developmentally.
Areas mentioned as needing more research and guidance included:
- Developing time and accuracy metrics to allow formal evaluation
- Developing metrics to evaluate the value of text mining
Reporting of text-mining tool use in the final review varied, with most KIs citing a lack of standards as problematic to transparently conveying what was done.
Librarian/Research Team Member Perspective
Motivation to use text-mining tools: All the librarians cited objectivity as one of their prime motivations to use text-mining tools to develop systematic review search strategies in a more rigorous manner. Generally, they expressed confidence in the resulting keywords and synonyms found as being more comprehensive and faster/easier to identify than the typical iterative approach of reviewing titles and abstracts for these. One KI also mentioned that text-mining brings out more subtle aspects of how a topic is (non-obviously) connected to other topics (e.g., how diabetes is implicated in several other childhood conditions that could/should also be searched for a review). Overall, the KIs noted that the time required to create the search strategy was decreased, and confidence in the resulting list of keyword/synonyms was high. While all of them recommended using text-mining tools, they also expressed a variety of qualifications:
- Some tools, like VOSviewer, while presenting intriguing visualization results, need to be more carefully assessed to determine how they can best be used to improve a search.
- Find the tools useful to create a search strategy but that they are not the end all and be all.
- Difficulty surrounds evaluating whether a corpus used to train a filter was indeed representative of the material it was developed to find.
- Complex topics (e.g., health services research) may be better suited to using text-mining approaches, whereas using “traditional” keyword/subject searches might be more suitable for straightforward one-drug/one-indication topics.
Keyword/synonym tools: Some of the tools are easy to use and can be learned to use quickly, especially the keyword/synonym type tools like PubReminer and GoPubMed. Integration with other databases or software is often not as seamless as desired because files may need to be reformatted to get data into/out of the tools. Some tools, like PubReminer, are easy to work with while others are not. In addition to generating lists of terms to use in a search, some KIs noted these tools can also be helpful in identifying terms that can be excluded from the result set. Please see Appendix C for more comments on specific text-mining tools.
Filter tools: Most of the librarian comments focused on the keyword/synonym-type tools rather than filter-type tools. One KI mentioned that due to the sensitivity of filters and greater retrieval of records that it is sometimes more efficient to approach the search via the “traditional” keyword/subject approach because it does not take as long to develop the search or screen the results. The published literature has more articles on filter development, so this focus on keyword/synonym tools was unanticipated and bears further scrutiny in future research to better understand which types of tools are more useful and under what conditions (e.g., straightforward questions versus complex questions, systematic review organization versus one-off research project).
IT environment: KIs generally had few problems using text-mining software online or installing it locally, if necessary; however, some issues did arise with an organizational server's security settings for one KI. Given local institutional IT risk tolerance, access to and/or downloading programs seems likely to be an issue for some research teams wanting to use these tools. One KI commented that using more complicated tools like General Architecture for Text Engineering (GATE) and VOSviewer required the help of IT staff to run correctly; though most KIs found they could run most of the keyword/synonym-type tools with no assistance.
Reporting the use of tools to develop the search strategy was variable, from not reporting using them to reporting the use of filter-type tools but not keyword/synonym tools. KIs who do not report keyword/synonym tools noted that other “background” methods used to develop keywords and synonyms are not typically reported; thus, these tools should not be a similar case. Performance evaluation, much less comparative performance evaluation, of these tools has not yet been researched, so it is not yet known whether using one tool or another may bias strategy development.
Identification of Individual Text-Mining Tools
We cast a broad net to identify text-mining tools and applications and retrieved many. We assessed 111 text-mining tools. To provide a meaningful summary, we narrowed the retrieval to a subset of tools that met one or more of the following criteria: (1) described in the literature and deemed as relevant or useful to a systematic review; (2) used and reported as a methodologic resource in a systematic review publication; or (3) mentioned by a KI during an interview (see Appendix D). In addition, we expanded the list to include those tools that met all of the following: (1) free and Web-based (i.e., not requiring download or license); (2) high likelihood of relevance to one or more steps in the systematic review process; and (3) high degree of confidence in the tool's stability and usability (i.e., a reliable connection, existence of help documentation, and/or literature references).
The findings from our preliminary assessment of tools (Appendix E) suggests that 73 (66%) were referenced in the literature captured by the literature review and 19 (17%) were identified by KIs (Table 5). Most of the tools (79%) we examined were available without cost via the web or through download of open source code. Some tools mentioned in papers published just a few years ago were no longer supported or functional. Fourteen resources were unavailable, retired, or nonfunctional at the time of our assessment.
Table 5
Summary of tools identified by key informants.
We designated 89 of the 111 as potentially applicable or useful to the conduct of systematic review (i.e., designed or could easily be modified to support 1 of the core steps for systematic literature review). Of those we were able to test (i.e., tools that did not require download or installation), 64 (57%) included a feature to support one or more of the key steps in the systematic literature review process. Most tools (n=52) supported searching, 44 supported scoping, 15 supported the screening process, and 14 aided information extraction.
- Results - EPC Methods: An Exploration of the Use of Text-Mining Software in Syst...Results - EPC Methods: An Exploration of the Use of Text-Mining Software in Systematic Reviews
Your browsing activity is empty.
Activity recording is turned off.
See more...