U.S. flag

An official website of the United States government

NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

Hartling L, Bond K, Harvey K, et al. Developing and Testing a Tool for the Classification of Study Designs in Systematic Reviews of Interventions and Exposures [Internet]. Rockville (MD): Agency for Healthcare Research and Quality (US); 2010 Dec.

Cover of Developing and Testing a Tool for the Classification of Study Designs in Systematic Reviews of Interventions and Exposures

Developing and Testing a Tool for the Classification of Study Designs in Systematic Reviews of Interventions and Exposures [Internet].

Show details

3Results

Identification of Taxonomies (Objective 1)

We contacted 31 organizations or individuals to identify taxonomies/study design classification instruments for further evaluation. Figure 1 shows the number of contacts made and the responses received.

Figure 1. Identification of study design classification tools.

Figure 1

Identification of study design classification tools.

The Steering Committee reviewed the 23 tools that were received; 10 were considered relevant for the purposes of this project. Tables 1 and 2 describe the tools received.

Table 1. Study design classification tools selected for further evaluation.

Table 1

Study design classification tools selected for further evaluation.

Table 2. Description of study design classification tools rejected for further evaluation.

Table 2

Description of study design classification tools rejected for further evaluation.

Selection of Taxonomies (Objective 2)

Five members of the Steering Committee (DD, LS, MV, KB, LH) independently reviewed and ranked the 10 tools based on criteria presented in Chapter 2. Table 3 provides the results of the ranking and observations of the tools made during the process.

Table 3. Steering committee rankings of tools from most (1) to least (10) preferred.

Table 3

Steering committee rankings of tools from most (1) to least (10) preferred.

The three top-ranked tools were a “design algorithm for studies of health care interventions” (DASHCI) developed by the Cochrane Non-Randomised Studies Methods Group (NRSMG; note that this tool is no longer advocated by the NRSMG) and tools developed by the American Dietetic Association (ADA), and the RTI-UNC (Appendix H). All three were algorithms, i.e., they provided a logical sequence of “yes or no” decisions to make when classifying studies. The tools featured two starting points for design classification: (1) the assignment of the intervention/exposure and (2) the number of comparison groups. None of the taxonomies covered the range of study designs that systematic reviewers might encounter when conducting EPC evidence reports. Further, the nomenclature was inconsistent among the algorithms.

Due to the perceived failure of any single tool to meet the needs of systematic reviewers in terms of the comprehensiveness of study designs, the Steering Committee decided to select a single tool and to incorporate the desirable features from the other tools. The DASHCI tool emerged as the most preferred design algorithm and was used as the basis for further development (Table 1).

Taxonomy Testing and Development of Reference Standard (Objective 4)

Reference Standard

Three members of the Steering Committee (DD, LS, MV) developed the reference standard by independently applying the flow diagram to the 30 studies. Disagreements were resolved through discussion and consensus. All three reviewers had doctoral level training in epidemiology or research design and 4 to 8 years experience in systematic reviews and/or EPC work.

The three reviewers agreed on the classification of seven studies (23 percent), two of the three agreed on the classification of 14 (47 percent), and there was no agreement on the classification of nine (30 percent). The overall agreement was fair (κ=0.33).

Disagreements occurred at most decision points in the algorithm except for the following three: (1) “Were at least three measurements made before and after intervention/exposure?” (2) “Was intervention/exposure data registered prior to disease?” and (3) “Were both exposure/intervention and outcome assessed prospectively?” Each of these decision nodes was the last in their respective branches of the flow diagram.

The area that created the greatest confusion and disagreements for the reference standard raters was the decision node “Was there a single cohort?” Specifically, it was often difficult to determine whether the two groups under study were derived from the same cohort and the tool did not provide any criteria to make this decision. A second decision node where disagreements occurred was the first in the flow diagram: “Was there a comparison?” Specifically, it was unclear whether or not to classify the study as having a comparison when subgroup analyses were performed within a single group. A third point of disagreement was determining when a study was an interrupted time series (i.e., measurements taken at a minimum of three timepoints before and three timepoints after the intervention). While there is a precedent for this definition (http://www.epoc.cochrane.org/Files/Website/Reviewer%20Resources/inttime.pdf), the number of required timepoints may not be universally accepted.

Test 1

Tester characteristics. Six staff members at the UAEPC tested the modified classification tool. These individuals had varying levels of relevant training, experience with systematic reviews in general, and experience with EPC work specifically. The length of time they had worked with the UAEPC ranged from 9 months to 9 years. Three of the testers had obtained a master’s degree in public health or epidemiology and three testers were undertaking graduate level training in epidemiology or library and information sciences.

The time taken to classify the 30 studies ranged from 7 to 9 hours with a mean of 8 hours overall and 16 minutes per study. Since the tool was new to the testers, this time reflects, in part, the process of familiarizing themselves with the flow diagram and the accompanying definitions.

Agreement. There were no studies for which all six testers agreed on the classification (Table 4). Five of six testers agreed on seven studies, four agreed on five studies, three agreed on nine studies, two agreed on eight studies. The overall level of agreement was considered fair (κ=0.26) (see Table 5 for interpretation of Fleiss’ kappa statistic). The degree of agreement for testers who had completed graduate level training was fair (κ=0.38), while for testers undertaking graduate training it was slight (κ=0.17).

Table 4. Results of testing.

Table 4

Results of testing.

Table 5. Interpretation of Fleiss’ kappa (κ)(from Landis and Koch 1977).

Table 5

Interpretation of Fleiss’ kappa (κ)(from Landis and Koch 1977).

We examined the disagreements in design classification and no clear patterns emerged (Table 6). Disagreements occurred at all decision points in the taxonomy. One decision point, “Was there a single cohort?”, emerged as particularly problematic. The following terminology and contrasts used in the flow diagram were described as unclear or confusing: “group”, “group” vs. “cohort”, “concurrently”, “comparison”, and “exposure” vs. “intervention.”

Table 6. Classification of studies: Round 1.

Table 6

Classification of studies: Round 1.

The testers were asked whether they thought the source of disagreement was due to the tool, the studies, or both. In one case the tester thought the tool was good and two felt it was easy to use; however, they generally remarked that the disagreement arose due to poor reporting at the study level. For example, it was often unclear whether a study was prospective or retrospective; in fact, one study was described as retrospective in the Abstract and prospective in the Methods section. One tester commented that the variety of topics covered by the 30 studies made classification challenging and the tool may be easier to apply in the context of a systematic review in which studies are more similar in terms of topic and design issues.

Four of the testers commented that they were often motivated by what design they thought the study to be. For instance, some testers said that they would read the study, determine their own sense of what the design was, and work backwards through the flow diagram to justify their design selection. Alternatively, they would work through the flow diagram to a design endpoint, check the definition to ensure that it was consistent with their own interpretation, and then work backwards to the decision node that would take them to the design they thought was more appropriate.

Finally, two testers indicated that they did not use the definitions that accompanied the flow diagram, while two testers said that the descriptions helped them make decisions and make sense of the letter answer. Several testers indicated that they preferred design labels on the flow diagram rather than the letter codes.

Accuracy of testers compared to reference standard. The accuracy of the testers was assessed against the reference standard (Table 7). There were no studies for which all six testers agreed with the reference standard and there was generally wide variation in level of accuracy across the studies.

Table 7. Accuracy of testing compared to reference standard.

Table 7

Accuracy of testing compared to reference standard.

Test 2

Tester characteristics. Six staff members at the UAEPC were involved in the second round of testing. Three of the testers had been involved in the first round of testing, while three of the testers had no previous involvement with the project or the taxonomy being tested. One tester had a PhD in medicine, three testers had a master’s degree in epidemiology, and two testers had undergraduate degrees in health sciences or related field and were undertaking graduate level training in epidemiology. The length of time the testers had worked with the UAEPC ranged from 2 months to 9 years. Four of the testers used a flow diagram that had the study design labels, while two of the testers used a flow diagram with letter codes.

Agreement. The time taken to classify the 15 studies ranged from 2.25 to 4 hours with means of 2.75 hours overall and 11 minutes per study. There were three studies for which all six testers agreed on the classification. Five of six testers agreed on two studies, four agreed on six studies, three agreed on two studies, and two agreed on two studies. The overall level of agreement was considered moderate (κ=0.45) (Table 4). The degree of agreement for testers who had completed graduate level training was moderate (κ=0.45), while for testers undertaking graduate training it was fair (κ=0.39). The level of agreement was moderate for both those who had the flow diagram with study design labels (κ=0.41) and for those with letter codes (κ=0.55).

Accuracy of testers compared to reference standard. The accuracy of the testers was assessed against the reference standard (Table 6). There were three studies for which all six testers agreed with the reference standard, but generally there was wide variation in the level of accuracy across the studies. Table 8 presents the accuracy of the testers against the reference standard by study design. There was improved accuracy for RCTs, nonrandomized trials, retrospective cohorts, interrupted time series (ITS) without comparison, and case-control studies. Accuracy decreased for controlled before-after studies, non-concurrent cohorts, and noncomparative studies. There was no difference for one before-after study. No comparisons could be made for ITS with comparison.

Table 8. Accuracy of testing compared to reference standard by study design.

Table 8

Accuracy of testing compared to reference standard by study design.

We examined the classification of studies to identify patterns of disagreements (Table 9). The most common disagreements occurred at four key decision nodes in the flow diagram: whether the study was “experimental” (5/15 studies), whether there was a comparison (4/15 studies), whether the assessment of exposure and outcome was prospective or retrospective (3/15 studies), and whether the intervention/exposure and outcome data were gathered concurrently (2/15 studies).

Table 9. Classification of studies: Round 2.

Table 9

Classification of studies: Round 2.

Views

  • PubReader
  • Print View
  • Cite this Page
  • PDF version of this title (674K)

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...