The methods of this evidence synthesis are based on the methods outlined in the Agency for Healthcare Research and Quality (AHRQ) Methods Guide for Effectiveness and Comparative Effectiveness Reviews ( and the U.S. Preventive Services Task Force (USPSTF) Procedure Manual ( The main sections in this chapter reflect the elements of the protocol established for the review. The methods and analyses were determined a priori, except where otherwise specified.

Topic Refinement and Technical Expert Panel

The National Institutes of Health Office of Medical Applications of Research (OMAR) commissioned this report and it was conducted by AHRQ through the Evidence-based Practice Center (EPC) Program. The Key Questions were developed by OMAR (Key Questions 3 to 5) and the USPSTF. OMAR will use the review to inform a consensus meeting and guideline development. The USPSTF joined this effort and will use the review to update its recommendation on screening for gestational diabetes mellitus.

Investigators from the University of Alberta EPC worked in consultation with representatives from AHRQ, OMAR and the USPSTF, and a panel of Technical Experts to operationalize the Key Questions. The Technical Expert Panel provided content and methodological expertise throughout the development of this evidence synthesis.

Literature Search Strategy

Our research librarian systematically searched the following bibliographic databases for studies published from 1995 to May 2012: MEDLINE® Ovid, Ovid MEDLINE® In-Process & Other Non-Indexed Citations, Cochrane Central Register of Controlled Trials (contains the Cochrane Pregnancy and Childbirth Group, which hand searches journals pertinent to its content area and adds relevant trials to the registry), Cochrane Database of Systematic Reviews (CDSR), Database of Abstracts of Reviews of Effects (DARE), Global Health, Embase, Pascal CINAHL Plus with Full Text (EBSCO host), BIOSIS Previews® (Web of KnowledgeSM), Science Citation Index Expanded® and Conference Proceedings Citation Index- Science (both via Web of ScienceSM), PubMed®, LILACS (Latin American and Caribbean Health Science Literature), National Library of Medicine (NLM) Gateway, and OCLC ProceedingsFirst and PapersFirst. We searched trial registries, including the WHO International Clinical Trials Registry Platform (ICTRP),, and Current Controlled Trials.

We limited the search to trials and cohort studies published in English. For the search strategies, the research librarian developed a combination of subject headings and keywords for each electronic resource (see Appendix A for the detailed search strategies). The search strategies were not peer reviewed.

We searched the Web sites of relevant professional associations and research groups, including the American Diabetes Association, International Association of the Diabetes in Pregnancy Study Groups, International Symposium on Diabetes in Pregnancy, and Australasian Diabetes in Pregnancy Society for conference abstracts and proceedings from the past 3 years. We reviewed the reference lists of relevant reviews (including the 2008 USPSTF review) and included studies to identify additional studies.

We used Reference Manager® for Windows version 11.0 (2004–2005 Thomson ResearchSoft) bibliographic database to manage the results of our literature searches.

Inclusion and Exclusion Criteria

The research team developed the review eligibility criteria in consultation with the technical expert panel. The inclusion and exclusion criteria are presented in Table 2. We included studies only when less than 20 percent of enrolled women had a known history of pre-existing diabetes or separate data were provided for women with no pre-existing diabetes.

Table 2. Eligibility criteria for the review.

Table 2

Eligibility criteria for the review.

We limited our eligibility criteria to studies published in English due to lack of translation resources. This decision was made in consultation with the technical expert panel, which expressed no concerns that limiting the search to English language would forfeit important studies. We included studies that were published since 1995 in order to capture several key studies that were published in the late 1990s.

Randomized controlled trials (RCTs), nonrandomized controlled trials (NRCTs), and prospective and retrospective cohort studies were eligible for inclusion.

Study Selection

We assessed the eligibility of articles in two phases. In the first phase, two reviewers used broad criteria to independently screen the titles, keywords, and abstracts (when available) (Appendix B1). They rated each article as “include,” “exclude,” or “unclear.” We retrieved the full text article for any study that was classified as “include” or “unclear” by at least one reviewer. Two reviewers independently assessed each full text article using a detailed form (Appendix B2). We resolved disagreements by discussion and consensus or third-party adjudication.

Quality Assessment of Individual Studies

Two reviewers independently assessed the methodological quality of the studies and resolved discrepancies by discussion and consensus. We tested each quality assessment tool on a sample of studies and developed guidelines for assessing the remaining studies. In addition, we extracted the source of funding for each study. For studies included in Key Questions 2 to 5, we summarized the quality as “good,” “fair,” or “poor” based on assessments from the tools described below.

Quality Assessment of Diagnostic Studies

We assessed the methodological quality of studies relevant to Key Question 1 using the quality assessment of diagnostic accuracy studies (QUADAS)-2 checklist.55 The tool consists of 14 items addressing important common biases in diagnostic studies such as spectrum, incorporation, verification, disease progression, and information biases. Individual items are rated “yes,” “no,” or “unclear” (Appendix B3a).

Quality Assessment of Trials

We assessed the internal validity of RCTs and NRCTs using the Cochrane Collaboration Risk of Bias tool (Appendix B3b). This tool consists of seven domains of potential bias (sequence generation, allocation concealment, blinding or participants and personnel, blinding of outcome assessment, incomplete outcome data, selective outcome reporting, and “other” sources of bias) and a categorization of the overall risk of bias.

Each domain was rated as having “low,” “unclear,” or “high” risk of bias. We assessed the blinding and incomplete outcome data items separately for subjective outcomes (e.g., depression scale) and objective clinical outcomes (e.g., mortality). We reported any additional sources of bias, such as baseline imbalances or design-specific risks of bias, in the “other” sources of bias domain.

The overall risk of bias assessment was based on the responses to individual domains. If one or more of the individual domains had a high risk of bias, we rated the overall score as high risk of bias. We rated the overall risk of bias as low only if all components were assessed as having a low risk of bias. The overall risk of bias was unclear in all other situations.

Quality Assessment of Cohort Studies

We used the Newcastle-Ottawa Quality Assessment Scale (Appendix B3c) to assess the methodological quality of prospective and retrospective cohort studies. The scale comprises eight items that evaluate three domains of quality: sample selection, comparability of cohorts, and assessment of outcomes. Each item that is adequately addressed is awarded one star, except for the “comparability of cohorts” item, for which a maximum of two stars can be given.

The overall score is calculated by tallying the stars. We considered a total score of 7 to 9 stars to indicate high quality, 4 or 6 stars to indicate moderate quality, and 3 or fewer stars to indicate poor quality.

Data Extraction

We extracted data using a structured, electronic form and imported the data into a Microsoft Excel™ 2007 spreadsheet (Microsoft Corp., Redmond, WA) (Appendix B4). One reviewer extracted data, and a second reviewer checked the data for accuracy and completeness. Reviewers resolved discrepancies by discussion and consensus or in consultation with a third party. We extracted the following data: author identification, year of publication, source of funding, study design, population (e.g., inclusion and exclusion criteria, number of patients enrolled, study withdrawals, duration of followup), patient baseline characteristics (e.g., age, race, ethnicity, weight, body mass index, previous diagnosis of gestational diabetes mellitus (GDM), family history of diabetes, comorbidities, smoking prevalence), details of the screening or diagnostic test and reference standard, glucose threshold for GDM, type of treatment, and outcomes, including adverse events.

We reported outcomes only if quantitative data were reported or could be derived from graphs. We did not include outcomes that were described only qualitatively (e.g., if study authors reported that “there was no difference between the groups”) or for which only a p-value was reported.

We planned to extract any cost-related data, including costs to patients, insurance, or health care system, that were reported in the included studies. However, we did not search for cost effectiveness studies or conduct cost-effectiveness analyses of different treatment strategies. Studies that reported only costs and provided no other outcome data were not included in the review.

When more than one publication reported the results of a single study, we considered the earliest published report of the main outcome data to be the primary publication. We extracted data from the primary publication first and then any additional outcome data reported in the secondary publications.

Data Synthesis

We made the following assumptions and performed the following imputations to transform reported data into the form required for analysis. We extracted data from graphs using the measurement tool of Adobe Acrobat 9 Pro (Adobe Systems Inc., California, U.S.) when data were not reported in text or tables. As necessary, we approximated means by medians and used 95% confidence intervals (CI), p-values, or inter-quartile ranges to calculate or approximate standard deviations when they were not given. We calculated p-values when they are not reported.56

For Key Question 1, we constructed 2×2 tables and calculated sensitivity, specificity, positive and negative predictive values, accuracy (true positive plus true negative divided by the sum of true positive, true negative, false positive, and false negative) and yield (i.e., prevalence) of the screening or diagnostic tests. If studies were clinically homogenous, we pooled sensitivities and specificities using a hierarchical summary receiver-operator curve and bivariate analysis of sensitivity and specificity.57

We described the results of studies qualitatively and in evidence tables. For Key Questions 3 to 5, we performed meta-analysis to synthesize the available data when studies were sufficiently similar in terms of their study design, population, screening or diagnostic test, and outcomes. This was done using the Mantel-Haenszel method for relative risks and the inverse variance method for pooling mean differences. Due to the expected between-study differences, we decided a priori to combine results using the random effects model.58

We measured statistical heterogeneity among studies using the I2 statistic. We considered an I2 value of 75 percent or greater to represent substantial heterogeneity and did not pool studies indicating substantial heterogeneity. When studies were not pooled due to substantial heterogeneity, we performed subgroup analyses if the number of studies was sufficient to warrant these analyses.59 Factors to be considered for subgroup analyses included glucose thresholds for tests, type of treatment, maternal age, race or ethnicity, and weight or body mass index, previous diagnosis of GDM, family history of diabetes, and comorbidities, which were extracted from each study.

We used Review Manager Version 5.0 (The Cochrane Collaboration, Copenhagen, Denmark) to perform meta-analyses. For dichotomous outcomes, we computed relative risks to estimate between-group differences. If no event was reported in one treatment arm, a correction factor of 0.5 was added to each cell of the 2×2 table in order to obtain estimates of the relative risk. For continuous variables, we calculated mean differences for individual studies. We reported all results with 95% CI.

Where possible, we assessed publication bias both visually using the funnel plot and quantitatively using Begg's60 and Egger's61 tests. Review Manager version 5.0.22 (The Cochrane Collaboration, Copenhagen, Denmark) and Stata version 7.0 (Stata Corp., College Station, TX) were used for all these analyses. In the event that studies could not be pooled, a narrative summary of the results was presented.

Strength of the Body of Evidence

Two independent reviewers graded the strength of evidence for major outcomes and comparisons for Key Questions 3 and 4 using the EPC GRADE (Grading of Recommendations Assessment, Development, and Evaluation) approach. We resolved discrepancies by discussion and consensus. We graded the evidence for the following key outcomes: birth injury, preeclampsia, neonatal hypoglycemia, maternal weight gain, and long-term metabolic outcomes of the child and mother. We made a post hoc decision to grade shoulder dystocia and macrosomia. These were not included in the protocol as outcomes that would be graded but were felt by the clinical investigators to be important to grade.

For each outcome, we assessed four major domains: risk of bias (rated as low, moderate, or high), consistency (rated as consistent, inconsistent, or unknown), directness (rated as direct or indirect), and precision (rated as precise or imprecise). No additional domains were used.

Based on the individual domains, we assigned the following overall evidence grades for each outcome for each comparison of interest: high, moderate, or low confidence that the evidence reflects the true effect. When no studies were available or an outcome or the evidence did not permit estimation of an effect, we rated the strength of evidence as insufficient.

To determine the overall strength of evidence score, we first considered the risk of bias domain. RCTs with a low risk of bias were initially considered to have a “high” strength of evidence, whereas RCTs with high risk of bias and well-conducted cohort studies received an initial grade of “moderate” strength of evidence. Low quality cohort studies received an initial grade of “low” strength of evidence. The strength of evidence was then upgraded or downgraded depending on the assessments of that body of evidence on the consistency, directness, and precision domains.


We assessed the applicability of the body of evidence following the PICOTS (population, intervention, comparator, outcomes, timing of outcome measurement, and setting) format used to assess study characteristics. Factors that may potentially weaken the applicability of studies may include study population factors (e.g., race or ethnicity, age, risk level of GDM [i.e., weight, body mass index, previous GDM diagnosis, family history of diabetes], comorbidities), study design (i.e., highly controlled studies [e.g., RCTs] vs. observational studies), setting (e.g., primary vs. tertiary care), and experience of care providers.

Peer Review and Public Commentary

Peer reviewers were invited to provide written comments on the draft report based on their clinical, content, or methodologic expertise. Peer review comments on the draft report were addressed by the EPC in preparation of the final draft of the report. Peer reviewers did not participate in writing or editing of the final report or other products. The synthesis of the scientific literature presented in the final report does not necessarily represent the views of individual reviewers. The dispositions of the peer review comments are documented and will be published 3 months after the publication of the Evidence Report.

Potential reviewers must disclose any financial conflicts of interest greater than $10,000 and any other relevant business or professional conflicts of interest. Invited peer reviewers may not have any financial conflict of interest greater than $10,000. Peer reviewers who disclose potential business or professional conflicts of interest may submit comments on draft reports through AHRQ's public comment mechanism.

The draft report was posted for public commentary. Comments on the draft report were considered by the EPC in preparing the final report.