Gaynes BN, Gavin N, Meltzer-Brody S, et al. Perinatal Depression: Prevalence, Screening Accuracy, and Screening Outcomes. Rockville (MD): Agency for Healthcare Research and Quality (US); 2005 Feb. (Evidence Reports/Technology Assessments, No. 119.)

This publication is provided for historical reference only and the information may be out of date.

Perinatal Depression: Prevalence, Screening Accuracy, and Screening Outcomes.

In conducting this systematic review, we followed standardized procedures developed by the Agency for Healthcare Research and Quality (AHRQ) in collaboration with all its Evidence-based Practice Centers (EPCs) for such reviews. This chapter documents how we implemented those procedures to answer the three key questions on perinatal depression. We first discuss the role of the Technical Expert Advisory Group (TEAG). We then describe our inclusion/exclusion criteria, our strategy for identifying articles relevant for addressing the key questions, and our process for abstracting relevant information from the eligible articles and generating evidence tables. We also discuss our criteria for grading the quality of individual articles and the strength of the evidence as a whole. Finally, we explain the peer review process.

Role of the Technical Expert Advisory Group

Throughout the project, we enlisted the assistance of a TEAG to react to work in progress and advise us on substantive issues or possibly overlooked areas of research. The TEAG included four individuals with collective expertise in obstetrics, psychiatry, psychology, and research methods and both clinical and research experience in perinatal depression (see Appendix E, Acknowledgments). As in all such systematic reviews, the TEAG contributed to AHRQ's broader goals of (1) creating and maintaining science partnerships as well as public-private partnerships and (2) meeting the needs of an array of potential customers and users of its products. Thus, the TEAG was both an additional resource and a sounding board during the project.

To ensure robust, scientifically relevant work, we called on the TEAG to participate in conference calls and discussions through e-mail to

  • refine the analytic framework and key questions at the beginning of the project;
  • discuss the preliminary assessment of the literature, including inclusion/exclusion criteria;
  • identify relevant literature not revealed through our literature searches;
  • provide input on the information and categories included in evidence tables;
  • review proposed methods for data synthesis; and
  • help interpret preliminary findings.

Because of their extensive knowledge of this topic, we also asked TEAG members to participate in the external peer review of the draft report.

Literature Search Strategy

To ensure a comprehensive and reproducible literature search and appraisal, we identified relevant research studies using an explicit search strategy and uniformly applied a set of inclusion and exclusion criteria to the identified studies. We describe our criteria and approach in this section.

Inclusion and Exclusion Criteria

To identify relevant studies, we generated a list of inclusion and exclusion criteria for each key question. We made the criteria fairly restrictive to ensure that our conclusions would be based on the highest quality data available with the lowest risk of bias. Some criteria were common across the three key questions; others were specific to the question. Table 2 summarizes the criteria.

Table 2. Inclusion/exclusion criteria by key question.


Table 2. Inclusion/exclusion criteria by key question.

For all key questions, studies had to report on original data, be in English, and be published from January 1980 through March 2004. This time frame ensured that the applied reference standards were consistent with the Diagnostic and Statistical Manual for Mental Disorders, Third Edition (DSM-III), or later criteria for the diagnosis of depression. The study could be conducted in any clinical setting or home but had to be from a developed country to increase the likelihood of being generalizable to the US population. In our original criteria submitted in the research proposal, we proposed including only studies done in the United States, the United Kingdom and other Commonwealth/English-speaking countries, Europe, and Scandinavia. However, we determined after abstract review that such limitations would leave out a large number of relevant studies. Therefore, we modified our inclusion criteria to accept any study conducted in developed countries where the population could be generalized to pregnant and postpartum women in the United States, regardless of the language spoken. We excluded studies published before 1980 or in a language other than English and those on women in less developed countries. We also excluded studies of women with major or minor depression in which the outcomes of interest were not distinguishable from those for women with bipolar disorder, primary psychotic disorders, or maternity blues.

In addition, studies for all key questions had to assess women for major depression either alone or together with minor depression during pregnancy or the first year postpartum by means of a clinical assessment or structured clinical interview. For Key Question (KQ) 1, we excluded studies of the prevalence and incidence of perinatal depression that relied solely on self-report screens to identify depression. For KQs 2 and 3, we excluded studies that included women with known depressive disorders at the outset. In KQ 2, study investigators used the clinical assessment or structured clinical interview as the criterion or gold standard with which to assess the properties of the screening instrument. In many KQ 3 studies, investigators used the clinical assessment to measure the depression outcomes from screening with subsequent intervention among women found to be at elevated risk of depression. Studies that measured women's mood using self-report measures only were also included in KQ 3.

For KQ 1, we included both prospective and retrospective studies of the prevalence and incidence of perinatal depression and studies that were conducted for purposes other than determining the prevalence and incidence of perinatal depression but nevertheless included a population-based estimate meeting the other inclusion criteria (e.g., studies of the properties of screening instruments). Furthermore, to answer the second part of KQ 1, we included both clinical trials and case-control studies comparing the incidence or prevalence of depression among pregnant women and newly delivered mothers to prevalence among women of similar age during other nonchildbearing periods of their lives. We included only prospective studies in those reviewed for KQs 2 and 3.

Literature Search and Retrieval Process

We used three strategies to identify studies providing evidence related to the key questions: systematic searches of electronic databases using both search terms and author names, hand searches of reference lists of included articles, and consultation with the TEAG. First, we generated a list of Medical Subject Heading (MeSH) search terms for each key question in the feasibility study. We used these terms to search standard electronic databases: MEDLINE, Cumulative Index to Nursing & Allied Health Literature (CINAHL), PsycINFO, Sociofile, and the Cochrane Library.

We conducted the electronic database searches twice. We initially did them in April 2003 for the feasibility study.24 That study included three additional key questions, including questions on natural history, risk factors, and treatment effectiveness for perinatal depression. We found relevant articles for the three key questions of the current study under the natural history and treatment effectiveness searches. We therefore conducted these and the incidence or prevalence and mass screening searches again in March 2004 to capture any studies published and posted in the interim.

The subject headings used and the total yield from each source are shown in Table 3 by key question. We found a total of 837 unduplicated citations in the electronic searches and picked up an additional 9 citations through the hand searches and discussion with the TEAG, for a total of 846 citations. We also searched the Cochrane Collaboration database for prior systematic reviews using the keywords “perinatal” and “depression.” This search yielded 38 reviews.

Table 3. Literature search strategies and yield.


Table 3. Literature search strategies and yield.

Three senior reviewers with clinical expertise in perinatal depression reviewed the abstracts of articles identified during the literature search. Two clinicians evaluated each abstract against the inclusion criteria and resolved any differences in inclusion by consensus. In several instances, the abstracts did not provide enough information to make an inclusion decision; we pulled full articles to review for those studies. Of the 846 articles identified, 729 did not meet the inclusion criteria for any of the key questions and were therefore excluded, 8 studies were pulled for background only, and the remaining 109 articles were pulled for a full review.

Among the 109 studies pulled for full review, 50 did not meet our inclusion/exclusion criteria for any of the three key questions. The most common reason for exclusion was the absence of a gold standard (i.e., either a clinical assessment or structured clinical interview) for assessing depression, which eliminated 26 studies. Ten of the studies pulled for the evaluation of the properties of screening instruments were excluded because they did not report sensitivity and specificity or data from which these statistics could be computed. Other reasons for exclusion were depression assessed after the first year postpartum, no depression outcome measure, a retrospective study design, and restriction of the study sample to specific population subgroups (e.g., teens, patients of psychiatric hospitals). We based the last exclusion on two lines of reasoning. First, although groups such as adolescents are a key subgroup, our charge was to ensure that our results were generalizable to the broader US population. Second, these specific subpopulations are different enough from the remainder of the population that they warrant separate consideration. We excluded only one study because it was limited to an adolescent population.

We included the remaining 59 studies in our review, and some met the inclusion criteria for more than one key question. We abstracted 30 studies for KQ 1, 23 for KQ 2, and 15 for KQ3. We provide a graphical presentation of the disposition of the citations in Figure 2.

Figure 2. Perinatal depression article disposition.


Figure 2. Perinatal depression article disposition.

Data Collection and Assessment

The data collection process involved abstracting relevant information from the eligible articles and generating evidence tables that present the key details of the study design and the major findings from the articles. A trained member of the study team read and abstracted each article; a second member checked the table entries for accuracy against the original article.

Appendix C contains the final evidence tables in their entirety. They provide the study design details and major findings. The dimensions of each study design abstracted vary by key question, but they contain some common elements, such as author, year of publication, study location (e.g., country, state), population description, and sample size. We also collected information on the clinical interview instrument and diagnostic criteria used to diagnose depression and the age and racial and ethnic distribution of study subjects in each study.

The study results are recorded in the form reported in the article. However, for assessing consistency of results across the studies and for combining study results in a meta-analysis (see below), we also transformed the study results when necessary into consistent outcome measures using the appropriate statistical formulas. These computed data elements are shown in bold in the evidence tables (Appendix C).

We conducted data abstraction electronically in a word processing program and in such a way that study identifiers and results were easily transferred from the forms to electronic files for input into programs for meta-analysis.


We conducted a meta-analysis of the different prevalence and incidence estimates from studies abstracted for KQ 1 to arrive at single prevalence and incidence estimates for particular periods and points in time. We elaborate on these methods in Chapter 3. We also conducted meta-analyses of the different estimates of the receiver operating characteristics (ROC) curves for screening instruments evaluated for KQ 2, as described in Chapter 4. Because of the diversity of screening instruments and prevention interventions in the studies found for KQ 3, we did not conduct a meta-analysis for this key question.

Quality of Individual Articles

At the same time that we abstracted information on the study designs and findings in the included articles, we rated the quality of the studies. We developed a quality rating form for the screening accuracy (KQ 2) articles from criteria identified by the Cochrane Methods Working Group on Systematic Review of Screening and Diagnostic Tests.25 For studies addressing KQ1 and KQ 3, we modified the quality rating forms developed by Downs and Black for RCTs and observational studies.26 These forms are provided in Appendix B.

The quality rating forms rated the reporting completeness and clarity, external validity, internal validity, and the power or precision of each study for the relevant key questions. Hence, the ratings refer to the usefulness or quality of the article for our purposes and not necessarily for the original purpose of the research or article. Studies that were included in more than one key question were rated separately for each key question. The specific quality items rated are described in more detail in Chapters 3, 4, and 5 for KQs 1, 2, and 3, respectively.

The senior abstractor completed the quality rating form for each article; another project team member then reviewed the completed form for accuracy and completeness. The overall quality scores of these articles are recorded in the evidence tables (Appendix C); scores on each of the domains are provided in Chapters 3, 4, and 5. All graded studies were included in the analysis regardless of their quality score. However, evidence from studies graded as poor were given less weight in the qualitative and quantitative syntheses and discussion.

Strength of Overall Evidence

In addition to the individual studies, we also rated the strength of the collective evidence on each key question. We applied four separate criteria: (1) number of studies, (2) aggregate sample sizes over the studies, (3) quality of the individual studies, and (4) representativeness of the study populations in the studies.

External Peer Review

As is customary for all evidence reports and systematic reviews done for AHRQ, the RTI-UNC EPC requested review of this report from a wide array of outside experts in the field and from relevant professional societies and public organizations. AHRQ has also requested review from its own staff and appropriate federal agencies. We provide a list of the external peer reviewers in Appendix E. This report reflects substantive and editorial comments from this external peer review.


