Figure 1. Causal pathway for the screening and treatment of perinatal depression
The Agency for Healthcare Research and Quality (AHRQ), through its Evidence-Based Practice Centers (EPCs), sponsors the development of evidence reports and technology assessments to assist public- and private-sector organizations in their efforts to improve the quality of health care in the United States. The reports and assessments provide organizations with comprehensive, science-based information on common, costly medical conditions and new health care technologies. The EPCs systematically review the relevant scientific literature on topics assigned to them by AHRQ and conduct additional analyses when appropriate prior to developing their reports and assessments.
This report on perinatal depression was requested and funded by the Safe Motherhood Group (SMG). The SMG consists of representatives from several agencies within the U.S. Department of Health and Human Services (DHHS): the DHHS Office on Women's Health; Centers for Disease Control and Prevention; Health Resources and Services Administration; Maternal and Child Health Bureau; National Institutes of Health, National Institute of Mental Health, National Institute of Child Health and Human Development, National Institute on Drug Abuse, and the Office of Research on Women's Health; Food and Drug Administration; Substance Abuse and Mental Health Services Administration; and Agency for Healthcare Research and Quality.
To bring the broadest range of experts into the development of evidence reports and health technology assessments, AHRQ encourages the EPCs to form partnerships and enter into collaborations with other medical and research organizations. The EPCs work with these partner organizations to ensure that the evidence reports and technology assessments they produce will become building blocks for health care quality improvement projects throughout the Nation. The reports undergo peer review prior to their release.
AHRQ expects that the EPC evidence reports and technology assessments will inform individual health plans, providers, and purchasers as well as the health care system as a whole by providing important information to help improve health care quality.
We welcome comments on this evidence report. They may be sent by mail to the Task Order Officer named below at: Agency for Healthcare Research and Quality, 540 Gaither Road, Rockville, MD 20850, or by e-mail to epc@ahrq.gov.
Carolyn M. Clancy, M.D.
Director
Agency for Healthcare Research and Quality
Jean Slutsky, P.A., M.S.P.H.
Director, Center for Outcomes and Evidence
Agency for Healthcare Research and Quality
Kenneth S. Fink, M.D., M.G.A., M.P.H.
Director, EPC Program
Agency for Healthcare Research and Quality
Marian D. James, M.A., Ph.D.
EPC Program Task Order Officer
Agency for Healthcare Research and Quality
The authors of this report are responsible for its content. Statements in the report should not be construed as endorsement by the Agency for Healthcare Research and Quality or the U.S. Department of Health and Human Services of a particular drug, device, test, treatment, or other clinical service.
Context. Depression during pregnancy or the first year postpartum is impressively common and can have devastating consequences for the woman, her children, and other family members.
Objectives. We systematically review the evidence on (1) the prevalence and incidence of perinatal depression, (2) the accuracy of different screening instruments, and (3) the effectiveness of interventions for women screened as high risk for perinatal depression
Data Sources. MEDLINE, CINAHL, PsycINFO, Sociofile, and the Cochrane Library (1980 through March 2004); bibliographic hand searches; and experts.
Study Selection. The English-language studies assessed women for major depression alone or for major or minor depression. Studies of the prevalence and incidence of depression and the accuracy of screening tools had to include diagnostic confirmation by a reference standard. Studies involving interventions required a comparison group. Two reviewers independently evaluated each abstract to determine inclusion by consensus.
Data Extraction. A primary reviewer abstracted data on key variables from the articles directly into detailed evidence tables; a second reviewer confirmed accuracy.
Data Synthesis. We conducted a meta-analysis of the prevalence and incidence estimates to compute combined estimates for particular periods and points in time. We also conducted meta-analyses of the sensitivity and specificity of different screening instruments. For screening outcome studies, we were only able to synthesize qualitatively.
Results. We identified 30 studies of prevalence. For major depression alone, point prevalence estimates ranged from 3.1 percent to 4.9 percent at different times during pregnancy and 1.0 percent to 5.9 percent at different times during the first postpartum year. For major and minor depression, estimates of the point prevalence ranged from 8.5 percent to 11.0 percent during pregnancy and 6.5 percent to 12.9 percent during the first year postpartum. However, these prevalence estimates were not significantly different from those of similarly aged nonchildbearing women. Data on incidence were more limited.
We identified 10 studies of screening accuracy. One small study reported on accuracy during pregnancy. For postpartum depression, screeners appeared feasible, but the small number of depressed patients involved precluded identifying an optimal screener or threshold for screening. Screening instruments studied are generally good at identifying major depression alone, with accuracy consistent with reports from primary care settings, but they performed poorer for the major or minor depression category.
We found no studies directly testing whether screening improved outcomes. However, we identified 15 studies that used some sort of screening to identify women at risk of depression and for whom a subsequent intervention was provided. The results of four small studies of various psychosocial interventions during pregnancy did not demonstrate consistently superior outcomes. Results were also mixed for postpartum interventions. Six of nine studies of various psychosocial interventions reported significant improvement in depression for the experimental group. Two studies with pharmacologic interventions provided conflicting results.
Conclusions. Although limited, the available research suggests that depression is one of the most common perinatal complications and that fairly accurate and feasible screening measures are available. Studies with larger sample sizes and a greater racial and ethnic mix are needed. Researchers also need to determine whether screening itself leads to better access to proven treatment and improved outcome relative to usual care.
Depressive disorders are ubiquitous and remarkably impairing; they occur throughout the lifespan. Lifetime prevalence rates of depression from community-based surveys range from 4.9 percent to 17.1 percent.1–3 Gender plays an important role in the prevalence rates of depression; women report a history of major depression at nearly twice the rate of men.4 In particular, women of childbearing age are at high risk for major depression.2, 3, 5 Pregnancy and new motherhood may increase the risk of depressive episodes.
Depression is the leading cause of disease-related disability among women in the world.6 It can have devastating consequences, not only for the women experiencing it but also for the women's children and family.7–9 For example, Stein and colleagues found that the mother-child interactions of depressed mothers and their children were of lower quality than those of nondepressed mothers,10 and Flynn et al. found that maternal depression was related to both missed pediatric appointments and greater use of emergency department services.11 A review of other research in this area points out that parental depression has been linked to raised levels of psychiatric disturbances among children and to greater child insecurity in attachment relationships.7, 8
The importance of detecting and treating perinatal depression has only recently been recognized. Perinatal depression encompasses major and minor depressive episodes that occur either during pregnancy or within the first 12 months following delivery. Major depression is a distinct clinical syndrome for which treatment is clearly indicated,12 whereas the definition and management of minor depression are less clear. Minor depression is an impairing yet less severe constellation of depressive symptoms13 for which controlled trials have not consistently indicated whether particular interventions are more effective than placebo.14, 15 In this report, we address major depressive episodes alone, which we refer to as major depression, as well as a broader grouping of major or minor depression, which we refer to as such or by the more general terms “depression” or “depressive illness.” We necessarily rely on the specific definitions of minor depression used by the different authors of the reviewed studies.
Another mental disorder that can occur in the perinatal period is postpartum psychosis. Unlike postpartum depression, postpartum psychosis is a relatively rare event with an estimated incidence of 1.1 to 4.0 cases per 1,000 deliveries.16 The onset of postpartum psychosis is usually acute, within the first 2 weeks of delivery, and appears to be more common in women with a strong family history of bipolar or schizoaffective disorder.17 Postpartum psychosis is an important disorder in its own right, but it is not addressed specifically in this report.
Perinatal depression, major or minor, often goes unrecognized because many of the discomforts of pregnancy and the puerperium are similar to symptoms of depression.18, 19 The onset of major depression is believed to be impressively common in the postpartum period; researchers have found a 3-fold increase in the onset of major or minor depression in the first 5 weeks postpartum compared to women of similar age, marital status, and parity at nonchildbearing times.20 However, the precise levels of the prevalence and incidence of perinatal depression are uncertain. Published estimates of the rate of major or minor depression in the postpartum period range widely—from 5 percent to more than 25 percent of new mothers—depending on the assessment method, the timing of the assessment, and population characteristics.21–23
Although many screening instruments have been developed or modified to detect major or minor depression in pregnant and newly delivered women , the evidence on their screening accuracy relative to a reference standard has yet to be systematically reviewed and assessed.24 Evidence on the effectiveness of screening all pregnant women and providing a preventive intervention to those scoring at high risk has also not been systematically investigated and evaluated.24
| Key Question | |
|---|---|
| 1 | What is the incidence and prevalence of depression (major or minor) during pregnancy and during the postpartum period? Is it increased during pregnancy and the postpartum period compared to nonchildbearing periods? |
| 2 | What is the accuracy of different screening tools for detecting depression during pregnancy and the postpartum period? |
| 3 | Does prenatal or early postnatal screening for depressive symptoms with subsequent intervention lead to improved outcomes? |
We show a simple schematic of the causal pathway for the screening and treatment of perinatal depression and the links addressed by the three study questions in Figure 1
The second key question addresses the accuracy of different screening instruments for postpartum depression—that is, how well different instruments detect pregnant or postpartum women who have depression (sensitivity) and pregnant and postpartum women who do not have depression (specificity). We identify and abstract English-language and non-English-language studies of various cutoff scores for a variety of commonly used instruments but review only the English-language studies.
Finally, we review studies that provide evidence on whether interventions can reduce the prevalence and incidence of perinatal depression for women who are screened and found to be at high risk for the disorder. We also summarize evidence in these studies on the effect of screening with subsequent intervention on other health outcomes for the woman and her infant. This third question addresses whether the screening process itself ultimately leads to improved outcomes for perinatal depression. Studies had to use some form of screening to identify women for testing interventions involving a technique to address psychological status in the woman and had to have an outcome measured related to depression severity.
In this report, we provide the results of our systematic search and review of the published literature for evidence addressing these questions. In conducting this study, our intent was to answer the questions using the most reliable evidence available, obtain a sense of the strength of the available evidence, and identify gaps in the knowledge base that require further research. We follow a discussion of our general approach and methods in Chapter 2 with discussions of each of the question-specific methods and findings (Chapters 3, 4, and 5). In Chapter 6, we discuss our main conclusions, comment on the state of the evidence, and offer an agenda for future research studies. Appendix A presents the exact search strings for the electronic database searches. Appendix B contains copies of our quality rating forms. Appendix C presents the evidence tables, Appendix D provides a list of excluded articles, and Appendix E provides acknowledgments.
In conducting this systematic review, we followed standardized procedures developed by the Agency for Healthcare Research and Quality (AHRQ) in collaboration with all its Evidence-based Practice Centers (EPCs) for such reviews. This chapter documents how we implemented those procedures to answer the three key questions on perinatal depression. We first discuss the role of the Technical Expert Advisory Group (TEAG). We then describe our inclusion/exclusion criteria, our strategy for identifying articles relevant for addressing the key questions, and our process for abstracting relevant information from the eligible articles and generating evidence tables. We also discuss our criteria for grading the quality of individual articles and the strength of the evidence as a whole. Finally, we explain the peer review process.
Throughout the project, we enlisted the assistance of a TEAG to react to work in progress and advise us on substantive issues or possibly overlooked areas of research. The TEAG included four individuals with collective expertise in obstetrics, psychiatry, psychology, and research methods and both clinical and research experience in perinatal depression (see Appendix E, Acknowledgments). As in all such systematic reviews, the TEAG contributed to AHRQ's broader goals of (1) creating and maintaining science partnerships as well as public-private partnerships and (2) meeting the needs of an array of potential customers and users of its products. Thus, the TEAG was both an additional resource and a sounding board during the project.
To ensure robust, scientifically relevant work, we called on the TEAG to participate in conference calls and discussions through e-mail to
refine the analytic framework and key questions at the beginning of the project;
discuss the preliminary assessment of the literature, including inclusion/exclusion criteria;
identify relevant literature not revealed through our literature searches;
provide input on the information and categories included in evidence tables;
review proposed methods for data synthesis; and
help interpret preliminary findings.
Because of their extensive knowledge of this topic, we also asked TEAG members to participate in the external peer review of the draft report.
To ensure a comprehensive and reproducible literature search and appraisal, we identified relevant research studies using an explicit search strategy and uniformly applied a set of inclusion and exclusion criteria to the identified studies. We describe our criteria and approach in this section.
| Category | Inclusion | Exclusion |
|---|---|---|
| All Key Questions | ||
| Publication date | 1980 through March 2004 | |
| Setting | Developed countries only | Less-developed countries |
| Any clinical setting or homes | ||
| Populations | Humans only | Animal studies |
| Depressive illness assessed during pregnancy or first postpartum year | Trials addressing exclusively bipolar disorder, a primary psychotic disorder, or maternity blues | |
| Study design | Original data | Case reports, case series, letters, editorials, and non-systematic reviews that have no original data |
| Prevalence and Incidence (Key Question 1) | ||
| Study design | Prevalence or incidence study | |
| Epidemiologic cohort or weighted to be representative | ||
| Study population | Diagnosis of major depressive episode or postpartum depressive episode using criterion standard (see text) | Depressive disorder identified only by screen |
| Screening Accuracy (Key Question 2) | ||
| Study design | Must have criterion standard (see text) | Case-control studies |
| Studies must be prospective | ||
| Outcomes of interest | Sensitivity and specificity | |
| Study population | Patients who are screened for depression during pregnancy or during 12 months postpartum | Patients with known current depressive episode |
| Screening Interventions Criteria (Key Question 3) | ||
| Study design | Randomized controlled trial or prospective cohort study | Case-control studies |
| Outcomes of interest | Clinical status and functioning | |
| Study population | Patients identified by a screen during pregnancy or during 12 months postpartum as being at high risk of having depression | Patients with known current depressive episode |
For all key questions, studies had to report on original data, be in English, and be published from January 1980 through March 2004. This time frame ensured that the applied reference standards were consistent with the Diagnostic and Statistical Manual for Mental Disorders, Third Edition (DSM-III), or later criteria for the diagnosis of depression. The study could be conducted in any clinical setting or home but had to be from a developed country to increase the likelihood of being generalizable to the US population. In our original criteria submitted in the research proposal, we proposed including only studies done in the United States, the United Kingdom and other Commonwealth/English-speaking countries, Europe, and Scandinavia. However, we determined after abstract review that such limitations would leave out a large number of relevant studies. Therefore, we modified our inclusion criteria to accept any study conducted in developed countries where the population could be generalized to pregnant and postpartum women in the United States, regardless of the language spoken. We excluded studies published before 1980 or in a language other than English and those on women in less developed countries. We also excluded studies of women with major or minor depression in which the outcomes of interest were not distinguishable from those for women with bipolar disorder, primary psychotic disorders, or maternity blues.
In addition, studies for all key questions had to assess women for major depression either alone or together with minor depression during pregnancy or the first year postpartum by means of a clinical assessment or structured clinical interview. For Key Question (KQ) 1, we excluded studies of the prevalence and incidence of perinatal depression that relied solely on self-report screens to identify depression. For KQs 2 and 3, we excluded studies that included women with known depressive disorders at the outset. In KQ 2, study investigators used the clinical assessment or structured clinical interview as the criterion or gold standard with which to assess the properties of the screening instrument. In many KQ 3 studies, investigators used the clinical assessment to measure the depression outcomes from screening with subsequent intervention among women found to be at elevated risk of depression. Studies that measured women's mood using self-report measures only were also included in KQ 3.
For KQ 1, we included both prospective and retrospective studies of the prevalence and incidence of perinatal depression and studies that were conducted for purposes other than determining the prevalence and incidence of perinatal depression but nevertheless included a population-based estimate meeting the other inclusion criteria (e.g., studies of the properties of screening instruments). Furthermore, to answer the second part of KQ 1, we included both clinical trials and case-control studies comparing the incidence or prevalence of depression among pregnant women and newly delivered mothers to prevalence among women of similar age during other nonchildbearing periods of their lives. We included only prospective studies in those reviewed for KQs 2 and 3.
We used three strategies to identify studies providing evidence related to the key questions: systematic searches of electronic databases using both search terms and author names, hand searches of reference lists of included articles, and consultation with the TEAG. First, we generated a list of Medical Subject Heading (MeSH) search terms for each key question in the feasibility study. We used these terms to search standard electronic databases: MEDLINE, Cumulative Index to Nursing & Allied Health Literature (CINAHL), PsycINFO, Sociofile, and the Cochrane Library.
We conducted the electronic database searches twice. We initially did them in April 2003 for the feasibility study.24 That study included three additional key questions, including questions on natural history, risk factors, and treatment effectiveness for perinatal depression. We found relevant articles for the three key questions of the current study under the natural history and treatment effectiveness searches. We therefore conducted these and the incidence or prevalence and mass screening searches again in March 2004 to capture any studies published and posted in the interim.
| Key Question | Search Terms | Yield |
|---|---|---|
| All | MEDLINE and& CINAHL: (‘Puerperal Disorders’ and (Depression or ‘Depressive Disorder’)) or ‘Depression, Postpartum/ or perinatal depression.mp’ | |
| PsycINFO: “Depression, Postpartum” | ||
| Sociofile: “Postpartum Depression” | ||
| KQ 1 | … and “Natural History” or “Cohort Studies” or “Longitudinal Studies” or | MEDLINE = 165 |
| … and Incidence or Prevalence | CINAHL = 42 | |
| PsycINFO = 88 | ||
| Sociofile = 21 | ||
| Total unduplicated = 256 | ||
| KQ 2 | … and “Mass Screening” | MEDLINE = 67 |
| CINAHL = 25 | ||
| PsycINFO = 28 | ||
| Sociofile = 1 | ||
| Total = unduplicated 96 | ||
| KQ 3 | … and treatment.mp or Therapeutics or “treatment failure” or “treatment outcomes” or “treatment duration” or treatment errors” or “treatment delay” or “treatment complications” | MEDLINE = 513 |
| CINAHL = 90 | ||
| PsycINFO = 91 | ||
| Sociofile = 5 | ||
| Total unduplicated = 485 | ||
Three senior reviewers with clinical expertise in perinatal depression reviewed the abstracts of articles identified during the literature search. Two clinicians evaluated each abstract against the inclusion criteria and resolved any differences in inclusion by consensus. In several instances, the abstracts did not provide enough information to make an inclusion decision; we pulled full articles to review for those studies. Of the 846 articles identified, 729 did not meet the inclusion criteria for any of the key questions and were therefore excluded, 8 studies were pulled for background only, and the remaining 109 articles were pulled for a full review.
Among the 109 studies pulled for full review, 50 did not meet our inclusion/exclusion criteria for any of the three key questions. The most common reason for exclusion was the absence of a gold standard (i.e., either a clinical assessment or structured clinical interview) for assessing depression, which eliminated 26 studies. Ten of the studies pulled for the evaluation of the properties of screening instruments were excluded because they did not report sensitivity and specificity or data from which these statistics could be computed. Other reasons for exclusion were depression assessed after the first year postpartum, no depression outcome measure, a retrospective study design, and restriction of the study sample to specific population subgroups (e.g., teens, patients of psychiatric hospitals). We based the last exclusion on two lines of reasoning. First, although groups such as adolescents are a key subgroup, our charge was to ensure that our results were generalizable to the broader US population. Second, these specific subpopulations are different enough from the remainder of the population that they warrant separate consideration. We excluded only one study because it was limited to an adolescent population.
We included the remaining 59 studies in our review, and some met the inclusion criteria for more than one key question. We abstracted 30 studies for KQ 1, 23 for KQ 2, and 15 for KQ3. We provide a graphical presentation of the disposition of the citations in Figure 2
The data collection process involved abstracting relevant information from the eligible articles and generating evidence tables that present the key details of the study design and the major findings from the articles. A trained member of the study team read and abstracted each article; a second member checked the table entries for accuracy against the original article.
Appendix C contains the final evidence tables in their entirety. They provide the study design details and major findings. The dimensions of each study design abstracted vary by key question, but they contain some common elements, such as author, year of publication, study location (e.g., country, state), population description, and sample size. We also collected information on the clinical interview instrument and diagnostic criteria used to diagnose depression and the age and racial and ethnic distribution of study subjects in each study.
The study results are recorded in the form reported in the article. However, for assessing consistency of results across the studies and for combining study results in a meta-analysis (see below), we also transformed the study results when necessary into consistent outcome measures using the appropriate statistical formulas. These computed data elements are shown in bold in the evidence tables (Appendix C).
We conducted data abstraction electronically in a word processing program and in such a way that study identifiers and results were easily transferred from the forms to electronic files for input into programs for meta-analysis.
We conducted a meta-analysis of the different prevalence and incidence estimates from studies abstracted for KQ 1 to arrive at single prevalence and incidence estimates for particular periods and points in time. We elaborate on these methods in Chapter 3. We also conducted meta-analyses of the different estimates of the receiver operating characteristics (ROC) curves for screening instruments evaluated for KQ 2, as described in Chapter 4. Because of the diversity of screening instruments and prevention interventions in the studies found for KQ 3, we did not conduct a meta-analysis for this key question.
At the same time that we abstracted information on the study designs and findings in the included articles, we rated the quality of the studies. We developed a quality rating form for the screening accuracy (KQ 2) articles from criteria identified by the Cochrane Methods Working Group on Systematic Review of Screening and Diagnostic Tests.25 For studies addressing KQ1 and KQ 3, we modified the quality rating forms developed by Downs and Black for RCTs and observational studies.26 These forms are provided in Appendix B.
The quality rating forms rated the reporting completeness and clarity, external validity, internal validity, and the power or precision of each study for the relevant key questions. Hence, the ratings refer to the usefulness or quality of the article for our purposes and not necessarily for the original purpose of the research or article. Studies that were included in more than one key question were rated separately for each key question. The specific quality items rated are described in more detail in Chapters 3, 4, and 5 for KQs 1, 2, and 3, respectively.
The senior abstractor completed the quality rating form for each article; another project team member then reviewed the completed form for accuracy and completeness. The overall quality scores of these articles are recorded in the evidence tables (Appendix C); scores on each of the domains are provided in Chapters 3, 4, and 5. All graded studies were included in the analysis regardless of their quality score. However, evidence from studies graded as poor were given less weight in the qualitative and quantitative syntheses and discussion.
In addition to the individual studies, we also rated the strength of the collective evidence on each key question. We applied four separate criteria: (1) number of studies, (2) aggregate sample sizes over the studies, (3) quality of the individual studies, and (4) representativeness of the study populations in the studies.
As is customary for all evidence reports and systematic reviews done for AHRQ, the RTI-UNC EPC requested review of this report from a wide array of outside experts in the field and from relevant professional societies and public organizations. AHRQ has also requested review from its own staff and appropriate federal agencies. We provide a list of the external peer reviewers in Appendix E. This report reflects substantive and editorial comments from this external peer review.
Perinatal depression is generally recognized to be a common affliction among women during pregnancy and the first postpartum year. However, estimates of the prevalence and incidence of the condition vary widely—from 5 percent to more than 25 percent of pregnant women and new mothers—depending on the assessment method, the timing of the assessment, and population characteristics.21, 22, 27 To estimate disease burden more accurately and thereby better target and prioritize health care expenditures, we need more precise estimates of the prevalence and incidence of perinatal depression.
Two prior systematic reviews of the prevalence of perinatal depression—one for the early postpartum months and the other for pregnancy—are notable. O'Hara and Swain conducted the first meta-analysis of the prevalence of postpartum depression and investigated sources of variability in the prevalence estimates across studies.21 The authors combined estimates from 59 studies in which depression had been assessed at least 2 weeks postpartum using either a clinical interview or a validated self-report measure with an established cutoff (i.e., Beck Depression Inventory [BDI] ≥ 10; Edinburgh Postnatal Depression Scale [EPDS] ≥ 13; Zung Depression Scale ≥ 48; Center for Epidemiological Studies—Depression [CES-D] scale ≥ 16). Based on a total sample of 12,810 postpartum women, they estimated the average prevalence of postpartum depression to be 13.0 percent, with a 95% confidence interval (CI) of 12.3 percent to 13.4 percent. They found that self-report measures yielded significantly higher estimates of postpartum depression than interview-based methods and that longer evaluation periods resulted in higher estimates. The number of days postpartum when the depression assessment was made and the country in which the study was conducted did not significantly affect the prevalence estimates in their analysis.
More recently, Bennett et al. conducted a meta-analysis of prevalence estimates for depression during pregnancy.27 The authors combined estimates from 21 studies meeting predetermined inclusion criteria, including the assessment of depression by a structured clinical interview, the BDI, or the EPDS. Based on a total sample of 19,284 pregnant women, they estimated the prevalence of depression to be 7.4 percent (95% CI, 2.2 percent to 12.6 percent) during the first trimester, 12.8 percent (95% CI, 10.7 percent to 14.8 percent) during the second trimester, and 12.0 percent (95% CI, 7.4 percent to 16.7 percent) during the third trimester. The 95% CIs of these estimates overlap substantially, indicating that, given available evidence, the prevalence of depression during pregnancy cannot be said to differ significantly by trimester. The authors also found that, compared with structured clinical interviews, the self-report BDI produced significantly higher prevalence estimates, whereas the self-report EPDS produced statistically equivalent estimates.
Several factors point to the need for a reassessment of the prevalence of depression during pregnancy and the postpartum period at this time. First, the clinical definition of major depression has changed over time, becoming more precise. Definitions of major depression prior to the 1987 revision of the Diagnostic and Statistical Manual of Mental Disorders, Third Edition (DSM-III-R), were broader than subsequent definitions and likely included some minor depression and dysthymia. Minor depression is a proposed diagnosis for further study for which the 1994 DSM, fourth edition (DSM-IV), has defined research criteria;28 however, it has not yet been added to the DSM-IV. Furthermore, DSM-IV has an even more precise definition of major depression, requiring a minimum number of depressive symptoms and functional impairment, whereas DSM-III-R required only counts of depressive symptoms. Most of the literature reviewed in the O'Hara and Swain study21 (published in 1996) was published before 1994. Determining whether more recent studies affect the combined prevalence estimates and CIs is crucial to improving understanding of this disorder.
Most of the studies in Bennett et al.27 (done in 2004) were published after 1994. Whether the combined prevalence estimates refer to major and minor depression together or major depression alone is not clear. The text of the article discusses major depression, but the tables clearly indicate the inclusion of minor depression.
Second, neither review distinguished between measures of the point prevalence, the percentage of the population with depression at a given point in time (e.g., at 24 weeks gestational age or 9 weeks postpartum), and measures of period prevalence, the percentage of the population with depression over a period of time (e.g., during pregnancy or from delivery to the end of the first 3 months postpartum). Both types of estimates are used for the single combined prevalence estimates, although O'Hara and Swain did test the effect of differing time points and durations for the depression assessment in a meta-regression.
Third, neither of the reviews presented evidence of the incidence of perinatal depression—the percentage of the population with depressive episodes that begin within a given period of time.
Fourth, overall prevalence estimates from both reviews are confounded by false positives because they included prevalence estimates from studies that assessed depression with self-report instruments. As mentioned above, both systematic reviews found that self-report instruments produce significantly higher prevalence estimates than do clinical interviews.
Finally, although both systematic reviews discussed prevalence estimates for women who were not pregnant and had not recently delivered a child, neither study rigorously reviewed the evidence that compares depression rates for women during pregnancy and the first postpartum year to the rates for women of a similar age during nonchildbearing times.
This chapter reviews the literature addressing Key Question (KQ) 1: What is the prevalence and incidence of depression (major and minor) during pregnancy and during the first year postpartum? Is the prevalence or incidence increased during pregnancy and the first postpartum year compared to nonchildbearing periods?
We abstracted study features and all estimates of the prevalence and incidence of major and minor depression together and of major depression alone from the 30 included studies found through our literature searches described in Chapter 2. During the abstraction process, we graded the quality of the study based on selected study features. We then analyzed the estimates using a variety of meta-analytic methods described in this section.
Appendix B presents the quality rating form used for articles considered for KQ 1. The total possible score for these studies was 20 for studies without a comparison group and 25 for studies with a comparison group. For both types of studies, we considered those articles with a score of 16 or greater to be good, those with scores between 10 and 15 to be fair, and those with scores of 9 and below to be poor. The domains and maximum points possible for each domain are as follows:
Reporting (domain score of 9): Eight items covering study aims, measures, patient populations, findings, and statistical presentation; each scored yes or no (1 or 0), except for an item concerning principal confounders that was scored yes, partially, or no (2, 1, or 0, respectively).
External validity (domain score of 3): Three items relating to the representativeness of populations from which people were recruited and of settings and clinicians that treat such patients; each scored yes, no, or unable to determine (1, 0, or 0, respectively).
Internal validity—bias (domain score of 3): Three items relating to issues such as validation of the depression diagnosis through clinical interview, follow-up periods, and appropriate statistical tests; each scored yes, no, or unable to determine (1, 0, or 0, respectively).
Internal validity—confounding (domain score of 2 for studies without a comparison group and 4 for studies with a comparison group): Two items relating to sources of comparison groups, one for the adequacy of adjustments for confounding, and one for the handling of loss to follow-up; each scored yes, no, or unable to determine (1, 0, or 0, respectively).
Precision (domain score of 3 for studies without a comparison group and 6 with a comparison group): One item relating to the number of pregnant or postpartum women assessed for depression, with scores of 3 for more than 1,000 women, 2 for 250 to 1,000 women, 1 for 30 to 250 women, and 0 for fewer than 30 women. For studies with a comparison group, a second item gave points based on the size of the smallest comparison group: a score of 3 for more than 2,000 women, 2 for 1,000 to 2,000 women, 1 for 500 to 1,000 women, and 0 for fewer than 500 women.
We abstracted all estimates of the prevalence and incidence of major and minor depression together and major depression alone. We distinguished prevalence estimates by whether they were point or period estimates and both prevalence and incidence estimates by the time period covered. Time periods for point prevalence estimates were defined as trimesters during pregnancy and months during the first postpartum year. Estimates taken at different weeks of gestation but within the same trimester of pregnancy were considered as being conducted in the same time period (e.g., estimates taken week 14 through week 27 of gestation were considered the second trimester). Similarly, estimates taken at different weeks postpartum but within the same month postpartum were considered within the same time period (e.g., estimates taken during week 1 through week 4 postpartum would be considered month 1; week 5 through week 9 postpartum, month 2). Where we found two or more estimates within the same trimester of pregnancy or month postpartum, we used meta-analysis to obtain a combined estimate for that trimester or month. We then graphed the resulting estimates to determine how they changed throughout pregnancy and the first postpartum year.
We conducted similar procedures for period prevalence and incidence estimates. The relevant time periods were either single trimesters and months or multiple trimesters and months. Because we found fewer estimates of these types, however, we graphed period prevalence and incidence estimates for only the first 3 months postpartum.
We combined all estimates with the same diagnosis, estimate type, and time period using the meta command in Stata. This procedure uses the inverse-variance weighting method to calculate random effects summary estimates. It also produces (1) Q tests of the homogeneity of the estimates and (2) forest plots of the individual study and combined estimates and their CIs. To satisfy the normalcy assumptions of these methods, we first transformed the prevalence estimates into log odds estimates.
We reviewed the forest plots of the studies in each summary estimate to determine whether we could identify the source of any heterogeneity between studies. We then reran the meta-analyses excluding studies that were obvious outliers and for which we could identify the source of the bias. The new summary estimates are considered our best estimates of the prevalence and incidence of perinatal depression for the general female population in the United States.
To analyze associations between the prevalence of depression and study characteristics, we conducted cumulative meta-analysis and a series of meta-regressions on the point prevalence estimates for major and minor depression together and major depression alone. In the cumulative meta-analysis, we added studies one by one, based on publication year, to produce a new combined estimate with the cumulative evidence for each year. This procedure allowed us to see trends in the estimate over time. We conducted cumulative meta-analysis on the 2-month point prevalence estimates using the metacum command in Stata.
We then used the Stata metareg command to estimate several different meta-regression models. For all models, we used the log odds as the dependent variable and included the time point at which depression was assessed and indicators for whether the study enrolled only low-risk women and only women of low socioeconomic status (SES) as explanatory variables. The time point was represented by a categorical variable with included values for the first, second, and third trimesters and the first, second, and third months postpartum. The reference category for this variable was 4 to 12 months postpartum.
We estimated seven different models. Each had a different set of additional explanatory variables:
No additional explanatory variables;
Publication year;
Study country, categorized as the United States (the reference category), other western countries, and Asian countries;
Interview type, categorized as the Schedule for Affective Disorders and Schizophrenia (SADS) (the reference category), the Structured Clinical Interview for DSM Diagnoses (SCID), and other interview types;
Diagnostic criteria, categorized as Research Diagnostic Criteria (RDC) (the reference category), DSM III-R, DSM IV, and other criteria;
Whether depression was assessed only for women who were designated as at risk based on a screening instrument; and
The quality rating score.
To answer the second part of KQ 1, whether the prevalence and incidence of depression is higher during pregnancy and the first year postpartum compared to nonchildbearing periods, we computed odds ratios for studies with a comparison group of women of similar age during nonchildbearing times. Because the types and timing of prevalence and incidence estimates did not overlap in these studies, except for one time point, we did not conduct meta-analyses of the log odds ratios.
We found 28 prospective studies and two retrospective studies that met our inclusion criteria. Only three of the prospective studies included a comparison group of nonpregnant women of similar age.a In this section, we first describe the study characteristics and then present our analysis of the study results.
| Author, Year | Country | Sample Size | Who Interviewed | When Interviewed | Interview Type | Diagnostic Criteria |
|---|---|---|---|---|---|---|
| Prospective Cohort Studies without Comparison Groups | ||||||
| Affonso et al., 199029 | US | 202 | All | Pregnancy & PP | SADS-PPG | RDC |
| Areias et al., 199630 | Portugal | 54 | All | Pregnancy & PP | SADS | RDC |
| Berle et al., 200331 | Norway | 411 | All EPDS ≥ 8 & some < 8 | PP | MINI-V4.4/ MADRS | DSM-IV |
| Campbell and Cohn, 199132 | US | 1,033 | All | PP | SADS | RDC |
| Cooper et al., 199633 | England | 4,964 | EPDS ≥ 8 | PP | SCID | DSM-III-R |
| Cox et al., 198234 | Scotland | 105 | All | PP | SPI | Pitt's |
| Garcia-Esteve et al., 200335 | Spain | 1,123 | All EPDS ≥ 9 & some < 9 | PP | SCID-NP | DSM-IV |
| Gotlib et al., 198936 | Canada | 295 | All BDI ≥ 10 & some < 10 | Pregnancy & PP | SADS | RDC |
| Hobfoll et al., 199537 | US | 192 | All | Pregnancy & PP | SADS | RDC |
| Kent et al., 199938 | Australia | 710 | GHQ28 > 4 | PP | CIDI-A | DSM-III-R |
| Kitamura et al., 199339 | Japan | 120 | All | Pregnancy | SADS/ SADS-C | RDC |
| Kitamura et al., 199940 | Japan | 111 | All | Pregnancy & PP | SADS | RDC |
| Kumar and Robson, 198441 | England | 196 | All | Pregnancy & PP | SPI | RDC |
| Lee et al., 200142 | Hong Kong | 781 | All GHQ > 4 & some ≤ 4 | PP | Modified SCID | Modified DSM-III-R |
| Lee et al., 200143 | Hong Kong | 145 | All | PP | Modified SCID | Modified DSM-III-R |
| Lucas et al., 200144 | Spain | 641 | BDI > 21 | PP | Not specified | DSM-III-R |
| Matthey et al., 200345 | Australia | 408 | All | PP | DIS | DSM-IV |
| Murray and Cox, 199046 | England | 100 | All | Pregnancy | SPI | RDC |
| O'Hara et al., 198419 | US | 99 | All | Pregnancy & PP | SADS | RDC |
| Pop et al., 199347 | Netherlands | 293 | All | Pregnancy & PP | Not specified | RDC |
| Watson et al., 198448 | England | 128 | All | Pregnancy & PP | SPI | ICD-9 |
| Whiffen, 198849 | Canada | 115 | All | PP | SADS | RDC |
| Yamashita et al., 200050 | Japan | 88 | All | PP | SADS | RDC |
| Yonkers et al., 200123 | US | 802 | All IDS ≥ 18 or EPDS ≥ 12 & some < 12 | PP | SCID | DSM-IV |
| Yoshida et al., 199751 | England | 98 | All | PP | SADS | RDC |
| Prospective Studies with Comparison Groups | ||||||
| Cooper et al., 198852 | England | 483 cases | All GHQ ≥ 12 & some < 12 | PP | PSE/ MADRS | PSE ID/ Catego Class |
| 313 controls | ||||||
| Cox et al., 199320 | England | 232 cases | All EPDS ≥ 9 & some < 9 | PP | SPI | RDC |
| 232 controls | ||||||
| O'Hara et al., 199053 | US | 182 cases | All | Pregnancy & PP | SADS | RDC |
| 179 controls | ||||||
| Retrospective Studies | ||||||
| Bryan et al., 199954 | US | 403 | — | PP | Medical records | Diagnosis of 2 or more symptoms |
| Georgiopoulos et al., 200155 | US | 342 | — | PP | Medical records | Diagnosis |
BDI, Beck Depression Inventory; CIDI-A, Composite International Diagnostic Interview; DIS, Diagnostic Inventory Schedule; DSM-III-R, Diagnostic and Statistical Manual of Mental Disorders, Third Edition, Revised; DSM-IV, Diagnostic and Statistical Manual of Mental Disorders, Fourth Edition; EPDS, Edinburgh Postnatal Depression Scale; GHQ, General Health Questionnaire; ICD-9, International Classification of Diseases, Ninth Edition; MADRS, Montgomery-Asburg Depression Rating Scale; MINI-V4.4, Mini International Neuropsychiatric Interview, Version 4.4; PP, postpartum; PSE, Present State Examination; PSE ID, PSE Index of Definition; RDC, Research Diagnostic Criteria; SADS, Schedule for Affective Disorders and Schizophrenia; SADS-C, SADS Change Version; SADS-PPG, SADS-Pregnancy and Postpartum Guidelines; SCID, Structured Clinical Interview for DSM-III-R; SCID-NP, Structured Clinical Interview for DSM-III-R, Non-Patient Version; SPI, Standardized Psychiatric Interview.
Precision. The study sample sizes ranged from 54 to 4,964 women; the median sample size was 202 women. Although all the studies had an adequate sample size to provide a prevalence estimate of 10 percent with 80 percent power at a 95% confidence level, most were not large enough to allow subgroup analyses.
The three studies with comparison groups included 313, 232, and 179 women in the comparison groups.20, 52, 53 These sample sizes are inadequate to detect a difference as large as 5 percentage points in incidence or prevalence at 80 percent power and a 95% confidence level; a minimum sample size of more than 500 per group is required.
None of the studies was designed to compare rates of depression among women of different racial and ethnic groups. Sixteen of the 30 studies did not even specify the racial and ethnic composition of the study subjects. Among the other 14 studies, 5 included only white non-Hispanic women;19, 32, 38, 44, 47 two studies included only Chinese women;42, 43 and two others included only Japanese women.50, 51 The remaining five studies noted a racially mixed population, but all had a predominant race or ethnicity. In four of these studies, 73 percent to 90 percent of the women were white non-Hispanic,29, 36, 37, 48 and, in the fourth, 75 percent were Hispanic.23
Depression Assessment. Our inclusion criteria required that the study use a clinical interview or assessment to validate depression diagnoses. The prospective studies differed in who received a clinical interview, the interview instrument, the diagnostic criteria used to identify a depressive episode from the interview responses, and when the interview was conducted. These differences can affect the resulting estimates of prevalence and incidence.
Eighteen of the 28 prospective studies conducted a clinical interview on all study women. The remaining 10 studies first had study subjects complete a self-report depression screening instrument, such as the EPDS, the BDI, or the General Health Questionnaire (GHQ), a broader measure designed to assess the presence of psychiatric distress related to general medical illness. These studies then administered a clinical interview to women scoring over a predetermined cutoff on the screening instrument. Seven of the 10 studies also interviewed a small sample (e.g., 10 percent) of the women scoring below the cutoff, but few of the studies used the results from these interviews to adjust the final prevalence estimates for false negatives. Most studies used low enough cutoff scores that the resulting downward bias in the estimates was minimal. The one exception was the Lucas et al. study, which used a high cutoff of 21 on the BDI and did not interview any women scoring below the cutoff or adjust the resulting prevalence rates in any way, thereby introducing a significant, uncorrected downward bias.44
Different interview instruments have been developed for identifying depression diagnoses. These different instruments use different criteria for diagnosing depression. Little is known about how these different instruments and diagnostic criteria affect the prevalence and incidence estimates.
The most frequently used instrument among our studies was the SADS. This semistructured interview is widely used in clinical research and has well-established reliability and validity.56 O'Hara et al. adapted the SADS for use with pregnant and postpartum women.19 Twelve of the 28 prospective studies used this interview instrument.
Five of the studies used the section of the SCID that covers depressive disorders.57, 58 The SCID allows the interviewer to use additional questions to inquire about idioms of distress that are specific to the local context. Lee et al. used this feature of the SCID to incorporate questions about traditional Chinese customs used during the puerperium that may affect the clinical presentation of postpartum depression.42, 43 They also modified the instrument to identify cases of minor depression.
Five other studies used the Standardized Psychiatric Interview (SPI) of Goldberg et al.59 The SPI includes 10 five-point scales that rate the severity of neurotic symptoms in the 7 days preceding the interview and a rating of 12 abnormalities observed during the interview.
Other interview instruments used include the Composite International Diagnostic Interview (CIDI-A),60 the Diagnostic Interview Schedule,61 the Mini International Neuropsychiatric Interview (MINI-V4.4),62 the Present State Examination (PSE),63 and the Montgomery and Asberg Depression Rating Scale (MADRS).64
All studies that used the SADS and three of the studies that used the SPI based depression diagnosis on the RDC.65 To be diagnosed with depression, women had to have reported that they felt sad, tearful, or blue for at least 2 weeks. The 2-week criterion serves to rule out women who were experiencing postpartum blues only. In addition, for a diagnosis of major depression, the women had to have reported at least three or four additional symptoms, such as sleeping disturbances, loss of appetite, fatigue, loss of interest in usual activities or the ability to concentrate, psychomotor retardation, and suicidal thoughts. Women with only two to four of these symptoms were classified as having minor depression. The RDC attempts to differentiate between normal physical effects of pregnancy and the puerperium and actual symptoms of depression.
Five of the prospective studies based diagnoses of depression on DSM-III-R criteria and four based diagnoses on DSM-IV criteria. A diagnosis of major depression based on the DSM-III-R criteria is comparable with the RDC for definite major depression.66 However, the RDC includes criteria for minor depression, which, as mentioned above, received its first DSM mention in the fourth edition (DSM-IV)28 as a proposed category for further study. Other criteria used for diagnoses of depression included Pitt's criteria;67 the International Classification of Diseases, Ninth Edition (ICD-9); and PSE Index of Definition (PSE ID) and Catego Class.63
Point prevalence estimates
46 for major depression alone (Figure 4
Period prevalence estimates
17 for major and minor depression (Figure 5
12 for major depression alone (Figure 6
Incidence estimates
21 for major and minor depression (Figure 7
The numbers in parentheses in these figures are the number of estimates found in the 28 studies for that point or period of time.
For the two retrospective studies, the investigators had abstracted information on symptoms and diagnoses of depression from medical records beginning at delivery and extending to 1 year postpartum. Both studies provided only estimates of 1-year period prevalence. Bryan et al.54 provided estimates of the prevalence for both major and minor depression and major depression alone, whereas Georgiopoulos et al.55 provided only the prevalence of major depression alone. Bryan et al. identified a woman as having postpartum depression if any of the following criteria were found in her medical records:54 (1) two notations at least 2 weeks apart of symptoms of depression; (2) a documented diagnosis of depression by a physician, psychologist, nurse practitioner, or midwife; (3) a new prescription for an antidepressant with no evidence that it was for chronic pain or for any indication other than depression; and (4) documentation of symptoms sufficient to meet the DSM-IV criteria of major depression. Georgiopoulos et al.55 based their prevalence estimate solely on a documented diagnosis of postpartum depression.
| Author, Year | Reporting (9) | External Validity (3) | Internal Validity-Bias (3) | Internal Validity-Confounding (2) | Precision (3) | Total Score (20) |
|---|---|---|---|---|---|---|
| Prospective Cohort Studies without Comparison Groups | ||||||
| Affonso et al., 199029 | 4 | 0 | 3 | 0 | 1 | 8 |
| Areias et al., 199630 | 8 | 0 | 2 | 1 | 1 | 12 |
| Berle et al., 200331 | 5 | 0 | 2 | 0 | 2 | 9 |
| Campbell and Cohn, 199132 | 6 | 0 | 3 | 0 | 3 | 12 |
| Cooper et al., 199633 | 7 | 0 | 2 | 0 | 3 | 12 |
| Cox et al., 198234 | 5 | 1 | 3 | 1 | 1 | 11 |
| Garcia-Esteve et al., 200335 | 7 | 0 | 2 | 1 | 3 | 13 |
| Gotlib et al., 198936 | 5 | 2 | 1 | 1 | 2 | 11 |
| Hobfoll et al., 199537 | 6 | 3 | 2 | 0 | 1 | 12 |
| Kent et al., 199938 | 7 | 1 | 2 | 0 | 2 | 12 |
| Kitamura et al., 199339 | 8 | 0 | 3 | 1 | 1 | 13 |
| Kitamura et al., 199940 | 4 | 1 | 3 | 1 | 1 | 10 |
| Kumar and Robson, 198441 | 7 | 0 | 3 | 0 | 1 | 11 |
| Lee et al., 200142 | 6 | 2 | 2 | 1 | 1 | 12 |
| Lee et al., 200143 | 5 | 2 | 0 | 0 | 1 | 8 |
| Lucas et al., 200144 | 5 | 0 | 2 | 0 | 2 | 9 |
| Matthey et al., 200345 | 6 | 0 | 3 | 0 | 2 | 11 |
| Murray and Cox, 199046 | 6 | 0 | 3 | 0 | 1 | 10 |
| O'Hara et al., 198419 | 6 | 0 | 3 | 0 | 1 | 10 |
| Pop et al., 199347 | 7 | 1 | 3 | 0 | 2 | 13 |
| Watson et al., 198448 | 7 | 2 | 3 | 0 | 1 | 13 |
| Whiffen, 198849 | 6 | 0 | 3 | 0 | 1 | 10 |
| Yamashita et al., 200050 | 6 | 0 | 3 | 0 | 1 | 10 |
| Yonkers et al., 200123 | 8 | 0 | 2 | 2 | 2 | 14 |
| Yoshida et al., 199751 | 7 | 0 | 3 | 0 | 1 | 11 |
| Average | 6.0 | 0.6 | 2.4 | 0.4 | 1.5 | 11.1 |
| Retrospective Studies | ||||||
| Bryan et al., 199954 | 8 | 3 | 2 | 1 | 2 | 16 |
| Georgiopoulos et al., 200155 | 2 | 2 | 1 | 1 | 2 | 8 |
| Average | 5.0 | 2.5 | 1.5 | 1.0 | 2.0 | 12.0 |
| Prospective Studies with Comparison Groups | ||||||
| Cooper et al., 198852 | 6 | 0 | 2 | 0 | 2 | 10 |
| Cox et al., 199320 | 5 | 1 | 2 | 3 | 1 | 12 |
| O'Hara et al., 199053 | 7 | 0 | 3 | 2 | 1 | 13 |
| Average | 6.0 | 0.3 | 2.3 | 1.7 | 1.3 | 11.7 |
Note: Numbers in parentheses are total possible points.
In general, studies ranked good on reporting. The 28 prospective studies, both those with and those without comparison groups, scored an average of 6.0 out of 9 possible points for reporting. The retrospective studies scored 5.0 on average. Most studies clearly described the purpose of the study, the method of assessing depression, the characteristics of the patients in the study, and the study findings. Most studies also provided adequate information to estimate the random variability in the estimates and reported actual probability values for the statistical significance of the main outcomes. Fewer studies provided the distribution of the major principal confounders and described the characteristics of patients lost to follow-up. In particular, studies often did not discuss whether the women had prior depressive episodes or obstetrical complications and frequently did not report the women's socioeconomic status or race and ethnicity. Most studies also did not specifically exclude cases of bipolar disorder or psychosis.
Virtually all prospective studies rated poor on external validity. Prospective studies without a comparison group averaged 0.8 points out of 3 possible points; those with a comparison group averaged 0.3 points. These studies seldom supplied adequate information to determine whether study subjects were representative of the patient population of the facilities from which they were recruited and whether the recruitment facilities were representative of the facilities frequented by the general population in the geographic area. In contrast, the two retrospective studies, which were conducted using the Olmsted County Health Department and Mayo Clinic databases, included the majority of all newly delivered women in the county and therefore scored an average of 2.5 points on external validity.
We separated scores for internal validity into two sets of study design characteristics: those that may bias the prevalence estimates and those that reflect possible confounding factors, which relate to the comparability of the comparison groups and whether losses of patients to follow-up were taken into consideration. The prospective studies scored high on the first measure of internal validity; the studies without a comparison group averaged 2.4 of 3 points and the studies with a comparison group averaged 2.3 points. Virtually all prospective studies assessed the mood of study women within 2 weeks of designated times during pregnancy and postpartum and applied appropriate statistical tests for measuring incidence or prevalence. However, as noted above, 10 studies introduced potential bias by not administering the clinical interview to all study women.
The retrospective studies averaged a lower 1.5 points. Diagnoses were not validated through clinical interview for all women, and Georgiopoulos et al. did not provide adequate information to determine whether they used appropriate statistical techniques to compute the prevalence estimate.55
Studies with comparison groups could get 4 possible points for the internal validity confounding score. We awarded 2 additional points if the cases and controls were recruited from the same population and over the same period of time. Only two of the three prospective studies with comparison groups met these criteria. The comparison group in the Cooper et al. study comprised women interviewed by another researcher over a different time period in a different city. Study women were recruited from the appointments diary of the prenatal clinic and the delivery booking diary of the general practitioner unit of the John Radcliffe Hospital in Oxford; the comparison group was derived from a community sample of Edinburgh women of similar age but who were not pregnant and had not delivered in the previous 12 months.52
By contrast, in the Cox et al. study, both cases and controls resided in the North Staffordshire Health District.20 Cases were recruited from the prenatal clinic lists of the North Staffordshire Maternity Hospital; controls matching cases on marital status, number of children, and age (within 5 years) were recruited from four general practice registers. The O'Hara et al. study recruited cases from a public obstetrics and gynecology clinic and two private practices at the University of Iowa Hospitals and Clinics.53 Each subject was asked to provide the names of five acquaintances similar in age, marital status, work status, and number of children. The acquaintance most similar to the subject was selected as a control.
We also gave points for the internal validity confounding measure if the investigators made adjustments or discussed the possible direction and magnitude of any biases from confounding factors and if they took the loss of patients to follow-up into account in their prevalence or incidence estimate. A minority of studies met either of these criteria, resulting in an average score on this measure of 0.4 out of 4 possible points for prospective studies without comparison groups, 1.7 for prospective studies with comparison groups, and 1.0 for the retrospective studies.
Finally, we gave 17 studies with 30 to 250 pregnant or recently delivered women a precision score of 1, 10 studies with 250 to 1,000 women a precision score of 2, and 3 studies with more than 1,000 women a precision score of 3. None of the studies had a comparison group of at least 500 women; therefore, we awarded no additional points for precision. The average precision score was 1.5 for prospective studies without comparison groups, 1.3 for prospective studies with comparison groups, and 2.0 for the retrospective studies.
In summary, the included studies generally were rated as good on reporting and internal validity for bias, poor on external validity and internal validity for confounding, and only fair on precision.
| Start Date | End Date | Studies | Estimate | 95% Confidence Interval | P-Value for Test of Homogeneity |
|---|---|---|---|---|---|
| Point Prevalence | |||||
| 1st trimester | 29,40,41 | 6.4% | 2.3%–16.2% | 0.002 | |
| 2nd trimester | 19,36,37,41,53 | 11.0% | 5.7%–20.4% | 0.000 | |
| 3rd trimester | 29,36,37,40,41,46,47 | 8.7% | 4.9%–15.0% | 0.000 | |
| 1 week PP | 40 | 5.5% | 1.8%–12.4% | ||
| 1 month PP | 23,29,36,40,42,47,50 | 8.8% | 6.4%–11.9% | 0.002 | |
| 2 months PP | 31,32,35,37,43,49,53 | 11.3% | 7.7%–16.2% | 0.000 | |
| 3 months PP | 41,42,47,50 | 12.9% | 10.6%–15.8% | 0.707 | |
| 4 months PP | 29,47 | 4.3% | 0.6%–25.4% | 0.001 | |
| 5 months PP | 47 | 10.6% | 7.3%–14.7% | ||
| 6 months PP | 20 | 9.9% | 6.4%–14.5% | ||
| 7 months PP | 41,47 | 10.6% | 7.1%–15.6% | 0.180 | |
| 8 months PP | 47 | 6.5% | 4.0%–9.9% | ||
| 12 months PP | 41 | 6.5% | 2.7%–12.9% | ||
| Period Prevalence | |||||
| Conception | 2nd trimester | 30 | 9.3% | 3.1%–20.3% | |
| Conception | Birth | 30,39,41 | 18.4% | 14.3%–23.3% | 0.931 |
| 2nd trimester | 3rd trimester | 36 | 10.2% | 7.0%–14.2% | |
| Birth | 1 month PP | 50 | 13.6% | 7.3%–22.6% | |
| Birth | 2 months PP | 19,32,45 | 8.9% | 6.8%–11.7% | 0.135 |
| Birth | 3 months PP | 30,50,51 | 19.2% | 10.7%–31.9% | 0.016 |
| Birth | 5 months PP | 34 | 29.1% | 20.6%–38.9% | |
| Birth | 6 months PP | 20 | 13.8% | 9.6%–18.9% | |
| Birth | 8 months PP | 47 | 20.8% | 16.3%–25.9% | |
| Birth | 12 months PP | 30 | 53.7% | 39.6%–67.4% | |
| Incidence | |||||
| Conception | 1st trimester | 39,41 | 11.3% | 7.8%–16.3% | 0.757 |
| Conception | 2nd trimester | 30 | 5.8% | 1.2%–16.0% | |
| Conception | Birth | 30,39 | 14.5% | 8.1%–24.4% | 0.192 |
| 1st trimester | 2nd trimester | 41 | 2.7% | 0.6%–7.6% | |
| 2nd trimester | 3rd trimester | 36,41 | 2.2% | 1.1%–4.1% | 0.627 |
| 2nd trimester | 2 months PP | 37 | 12.5% | 7.9%–18.5% | |
| Birth | 1 month PP | 36,42,50 | 7.8% | 3.6%–16.1% | 0.003 |
| Birth | 2 months PP | 19 | 10.3% | 5.1%–18.1% | |
| Birth | 3 months PP | 30,41,42,50,51 | 14.5% | 10.9%–19.2% | 0.142 |
| Birth | 6 months PP | 20 | 11.1% | 7.3%–16.0% | |
| Birth | 12 months PP | 30 | 49.0% | 34.4%–63.7% | |
| Start Date | End Date | Studies | Estimate | 95% Confidence Interval | P-Value for Test of Homogeneity |
|---|---|---|---|---|---|
| Point Prevalence | |||||
| 1st trimester | 29,40,41 | 2.4% | 0.7%–8.2% | 0.032 | |
| 2nd trimester | 19,37,48,53 | 6.4% | 3.7%–11.0% | 0.029 | |
| 3rd trimester | 29,37,40,46,47 | 3.4% | 1.8%–6.4% | 0.116 | |
| 1 week PP | 40 | 0.0% | 0.0%–3.2% | ||
| 1 month PP | 23,42,44,50 | 2.8% | 1.5%–5.5% | 0.000 | |
| 2 months PP | 31,33,35,37,42,48,49 | 6.8% | 3.8%–11.9% | 0.000 | |
| 3 months PP | 42,44,50 | 3.8% | 2.4%–6.1% | 0.010 | |
| 4 months PP | 29,47 | 2.3% | 1.1%–4.9% | 0.435 | |
| 5 months PP | 47 | 2.1% | 0.8%–4.4% | ||
| 6 months PP | 20,38,44,52 | 4.2% | 2.1%–8.7% | 0.000 | |
| 7 months PP | 47 | 3.1% | 1.4%–5.8% | ||
| 8 months PP | 47 | 1.0% | 0.2%–3.0% | ||
| 9 months PP | 44 | 0.0% | 0.0%–0.7% | ||
| 12 months PP | 44,52 | 1.3% | 0.0%–56.6% | 0.206 | |
| Period Prevalence | |||||
| Conception | Birth | 39 | 12.7 | 7.1%–20.4% | |
| 1st trimester | Birth | 48 | 9.4% | 4.9%–15.8% | |
| Birth | 1 month PP | 50 | 5.7% | 1.9%–12.8% | |
| Birth | 2 months PP | 19,32, | 6.5% | 5.2%–8.2% | 0.516 |
| Birth | 3 months PP | 50,51 | 7.1% | 4.1%–11.7% | 0.626 |
| Birth | 5 months PP | 34 | 12.6% | 6.9%–20.6% | |
| Birth | 6 months PP | 20 | 6.5% | 3.7%–10.4% | |
| Birth | 8 months PP | 47 | 6.8% | 4.2%–10.4% | |
| Birth | 12 months PP | 44,48 | 6.6% | 0.5%–51.7% | 0.000 |
| Incidence | |||||
| Conception | Birth | 30,39,48 | 7.5% | 3.8%–14.2% | 0.116 |
| 2nd trimester | 2 months PP | 37 | 3.0% | 1.0%–6.8% | |
| Birth | 1 month PP | 23,42,50 | 3.9% | 2.9%–5.4% | 0.429 |
| Birth | 2 month PP | 48 | 8.1% | 4.0%–14.4% | |
| Birth | 3 months PP | 42,50,51 | 6.5% | 4.2%–9.6% | 0.767 |
| Birth | 12 months PP | 30 | 30.6% | 18.3%–45.4% | |
The results of these tests indicate that considerable heterogeneity exists across the studies included in many of the pooled estimates, particularly among the point prevalence estimates. Therefore, we first discuss the results of our analysis of outliers and then discuss the results of the revised meta-analyses. We finish this section by presenting the findings from the studies with comparison groups of nonchildbearing women.
Outliers. In a review of the forest plots of the meta-analyses of the prevalence and incidence estimates, we found estimates from several studies consistently to be outliers for all time periods at which they assessed the women's mood. Two studies included only women at low risk of depression.29, 32 Affonso et al.29 included only primigravida women with a viable fetus who were married or living with the infant's father and who had no recent depression episodes. Campbell and Cohn32 included only primiparous women who delivered full-term, single infants without major complications and who were Caucasian, married, over 17 years of age, and had at least a high school education. The estimates from these studies were consistently lower than the estimates from the other studies.
Two additional studies included only women of lower socioeconomic status.23, 37 These studies generally provided higher estimates of depression prevalence and incidence than the other studies.
The Lucas et al. study included only women who screened positive for depression on the BDI.44 The cutoff used (> 21) was so high that the bias from false negatives produced consistently lower prevalence estimates compared to the other studies.
Finally, because of its size, the Cooper et al. study dominated the combined 2-month point prevalence estimate for major depression alone.33 However, the 15.3 percent estimated point prevalence from this study is outside the 95% CI of the combined estimate for major and minor depression. The purpose of the study was not to produce a prevalence estimate but rather to develop a predictive index for postpartum depression. Furthermore, many of the clinical interviews were conducted by telephone and the article did not state whether a clinician or lay person conducted the interview. Thus, the procedures for assessing depression in this study may have introduced significant bias in the prevalence estimate.
| Start Date | End Date | Studies | Estimate | 95% Confidence Interval | P-Value for Test of Homogeneity |
|---|---|---|---|---|---|
| Point Prevalence | |||||
| 1st trimester | 40,41 | 11.0% | 7.6%–15.8% | 0.383 | |
| 2nd trimester | 19,36,41,53 | 8.5% | 6.6%–10.9% | 0.921 | |
| 3rd trimester | 36,40,41,46,47 | 8.5% | 6.5%–11.0% | 0.235 | |
| 1 week PP | 40 | 5.5% | 1.8%–12.4% | ||
| 1 month PP | 23,36,40,42,47,50 | 9.7% | 7.7%–12.3% | 0.060 | |
| 2 months PP | 31,35,43,49,53 | 10.6% | 8.7%–13.0% | 0.121 | |
| 3 months PP | 41,42,47,50 | 12.9% | 10.6%–15.8% | 0.707 | |
| 4 months PP | 47 | 10.6% | 7.3%–14.7% | ||
| 5 months PP | 47 | 10.6% | 7.3%–14.7% | ||
| 6 months PP | 20 | 9.9% | 6.4%–14.5% | ||
| 7 months PP | 41,47 | 10.6% | 7.1%–15.6% | 0.180 | |
| 8 months PP | 47 | 6.5% | 4.0%–9.9% | ||
| 12 months PP | 41 | 6.5% | 2.7%–12.9% | ||
| Period Prevalence | |||||
| Conception | 2nd trimester | 30 | 9.3% | 3.1%–20.3% | |
| Conception | Birth | 30,39,41 | 18.4% | 14.3%–23.3% | 0.931 |
| 2nd trimester | 3rd trimester | 36 | 10.2% | 7.0%–14.2% | |
| Birth | 1 month PP | 50 | 13.6% | 7.3%–22.6% | |
| Birth | 2 months PP | 19,45 | 9.6% | 8.0%–11.4% | 0.362 |
| Birth | 3 months PP | 30,50,51 | 19.2% | 10.7%–31.9% | 0.016 |
| Birth | 5 months PP | 34 | 29.1% | 20.6%–38.9% | |
| Birth | 6 months PP | 20 | 13.8% | 9.6%–18.9% | |
| Birth | 8 months PP | 47 | 20.8% | 16.3%–25.9% | |
| Birth | 12 months PP | 30 | 53.7% | 39.6%–67.4% | |
| Incidence | |||||
| Conception | 1st trimester | 39,41 | 11.3% | 7.8%–16.3% | 0.757 |
| Conception | 2nd trimester | 30 | 5.8% | 1.2%–16.0% | |
| Conception | Birth | 30,39 | 14.5% | 8.1%–24.4% | 0.192 |
| 1st trimester | 2nd trimester | 41 | 2.7% | 0.6%–7.6% | |
| 2nd trimester | 3rd trimester | 36,41 | 2.2% | 1.1%–4.1% | 0.627 |
| Birth | 1 month PP | 36,42,50 | 7.8% | 3.6%–16.1% | 0.003 |
| Birth | 2 months PP | 19 | 10.3% | 5.1%–18.1% | |
| Birth | 3 months PP | 30,41,42,50,51 | 14.5% | 10.9%–19.2% | 0.142 |
| Birth | 6 months PP | 20 | 11.1% | 7.3%–16.0% | |
| Birth | 12 months PP | 30 | 49.0% | 34.4%–63.7% | |
NOTE: Best estimates reflect the single or combined estimate at each point or period of time remaining after estimates with obvious, identifiable biases have been dropped.
PP, postpartum.
| Start Date | End Date | Studies | Estimate | 95% Confidence Interval | P-Value for Test of Homogeneity |
|---|---|---|---|---|---|
| Point Prevalence | |||||
| 1st trimester | 40,41 | 3.8% | 1.0%–12.6% | 0.092 | |
| 2nd trimester | 19,48,53 | 4.9% | 3.1%–7.4% | 0.752 | |
| 3rd trimester | 40,46,47 | 3.1% | 1.1%–8.1% | 0.038 | |
| 1 week PP | 40 | 0.0% | 0.0%–3.2% | ||
| 1 month PP | 40,42,47,50 | 3.8% | 2.2%–6.4% | 0.204 | |
| 2 months PP | 31,35,43,48,49,53 | 5.7% | 3.8%–8.7% | 0.001 | |
| 3 months PP | 41,42,47,50,52 | 4.7% | 3.6%–6.1% | 0.658 | |
| 4 months PP | 47 | 2.4% | 1.0%–4.9% | ||
| 5 months PP | 47 | 2.1% | 0.8%–4.4% | ||
| 6 months PP | 20,52 | 5.6% | 2.4%–12.1% | 0.028 | |
| 7 months PP | 47 | 3.1% | 1.4%–5.8% | ||
| 8 months PP | 47 | 1.0% | 0.2%–3.0% | ||
| 12 months PP | 52 | 3.9% | 2.3%–6.1% | ||
| Period Prevalence | |||||
| Conception | Birth | 39 | 12.7% | 7.1%–20.4% | |
| 1st trimester | Birth | 48 | 9.4% | 4.9%–15.8% | |
| Birth | 1 month PP | 50 | 5.7% | 1.9%–12.8% | |
| Birth | 2 months PP | 19 | 8.1% | 3.6%–15.3% | |
| Birth | 3 months PP | 50,51 | 7.1% | 4.1%–11.7% | 0.626 |
| Birth | 5 months PP | 34 | 12.6% | 6.9%–20.6% | |
| Birth | 6 months PP | 20 | 6.5% | 3.7%–10.4% | |
| Birth | 8 months PP | 47 | 6.8% | 4.2%–10.4% | |
| Birth | 12 months PP | 48 | 21.9% | 15.1%–30.0% | |
| Incidence | |||||
| Conception | Birth | 30,39,48 | 7.5% | 3.8%–14.2% | 0.116 |
| Birth | 1 month PP | 42,50 | 5.2% | 3.1%–8.9% | 0.819 |
| Birth | 2 months PP | 48 | 8.1% | 4.0%–14.4% | |
| Birth | 3 months PP | 42,50,51 | 6.5% | 4.2%–9.6% | 0.767 |
| Birth | 12 months PP | 30 | 30.6% | 18.3%–45.4% | |
NOTE: Best estimates reflect the single or combined estimate at each point or period of time remaining after estimates with obvious, identifiable biases have been dropped.
PP, postpartum.
Point Prevalence. We show the best estimates for the point prevalence of major and minor depression graphically in Figure 9
The best estimates for the point prevalence of major depression alone (Figure 10
However, all estimates have wide 95% CIs. Moreover, as shown in Figure 11
Incidence. We also found few estimates of the incidence of depression—the percentage of women with depressive episodes that begin during pregnancy or the first year postpartum. The studies we found suggest that as many as 14.5 percent of pregnant women have a new episode of major or minor depression during pregnancy, and 14.5 percent have a new episode during the first 3 months postpartum (Table 8). Considering major depression alone, 7.5 percent of women may have a new episode during pregnancy and 6.5 percent during the first 3 months after delivery (Table 9). Figure 12
The results of the cumulative meta-analysis are graphed in Figure 13
| Explanatory Variables | Model | ||||||
|---|---|---|---|---|---|---|---|
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | |
| Constant | -2.291 (0.149) | -2.159 (0.174) | -2.564 (0.339) | -2.189 (0.218) | -2.273 (0.098) | -2.276 (0.151) | -2.125 (0.626) |
| P = 0.000 | P = 0.000 | P = 0.000 | P = 0.000 | P = 0.000 | P = 0.000 | P = 0.001 | |
| 1st trimester vs. | 0.065 (0.322) | 0.068 (0.316) | 0.064 (0.337) | 0.032 (0.310) | 0.064 (0.238) | 0.056 (0.324) | 0.042 (0.336) |
4 to 12 months PP | P = 0.840 | P = 0.830 | P = 0.850 | P = 0.917 | P = 0.788 | P = 0.863 | P = 0.901 |
| 2nd trimester vs. | 0.080 (0.244) | 0.029 (0.242) | 0.182 (0.274) | -0.004 (0.260) | 0.012 (0.166) | 0.091 (0.246) | 0.065 (0.252) |
4 to 12 months PP | P = 0.744 | P = 0.903 | P = 0.508 | P = 0.989 | P = 0.943 | P = 0.711 | P = 0.798 |
| 3rd trimester vs. | -0.014 (0.229) | -0.011 (0.224) | -0.009 (0.235) | -0.065 (0.226) | -0.075 (0.156) | -0.007 (0.230) | -0.028 (0.237) |
4 to 12 months PP | P = 0.953 | P = 0.960 | P = 0.971 | P = 0.775 | P = 0.630) | P = 0.976 | P = 0.904 |
| 1 month PP vs. | -0.115 (0.222) | -0.033 (0.226) | -0.109 (0.242) | -0.029 (0.237) | 0.147 (0.160) | -0.054 (0.240) | -0.120 (0.226) |
4 to 12 months PP | P = 0.606 | P = 0.883 | P = 0.652 | P = 0.902 | P = 0.357 | P = 0.822 | P = 0.594 |
| 2 months PP vs. | 0.336 (0.211) | 0.426 (0.216) | 0.379 (0.223) | 0.404 (0.226) | 0.377 (0.167) | 0.361 (0.214) | 0.323 (0.219) |
4 to 12 months PP | P = 0.110 | P =0.049 | P = 0.089 | P = 0.073 | P =0.024 | P = 0.092 | P = 0.139 |
| 3 months PP vs. | 0.346 (0.255) | 0.400 (0.252) | 0.339 (0.273) | 0.425 (0.245) | 0.377 (0.175) | 0.354 (0.256) | 0.342 (0.258) |
4 to 12 months PP | P = 0.175 | P = 0.113 | P = 0.214 | P = 0.082 | P =0.031 | P =00.167 | P = 0.185 |
| Low risk | -1.436 (0.271) | -1.494 (0.271) | -1.195 (0.389) | -1.529 (0.269) | -1.230 (0.195) | -1.474 (0.277) | -1.457 (0.278) |
| P = 0.000 | P = 0.000 | P = 0.002 | P = 0.000 | P = 0.000 | P = 0.000 | P = 0.000 | |
| Low SES | 0.753 (0.204) | 0.818 (0.204) | 0.988 (0.331) | 0.772 (0.192) | 1.083 (0.149) | 0.737 (0.206) | 0.774 (0.219) |
| P = 0.000 | P = 0.000 | P = 0.003 | P = 0.000 | P = 0.000 | P = 0.000 | P = 0.000 | |
| Publication year | — | -0.018 (0.013) | — | — | — | — | — |
| P = 0.170 | |||||||
| Other western countries vs. US | — | — | 0.273 (0.305) | — | — | — | — |
| P = 0.371 | |||||||
| Asian countries vs. US | — | — | 0.285 (0.349) | — | — | — | — |
| P = 0.414 | |||||||
| SCID vs. SADS | — | — | — | -0.489 (0.202) | — | — | — |
| P = 0.015 | |||||||
| Other interview type vs. SADS | — | — | — | -0.098 (0.174) | — | — | — |
| P = 0.574 | |||||||
| DSM III-R vs. RDC | — | — | — | — | -0.113 (0.188) | — | — |
| P = 0.548 | |||||||
| DSM IV vs. RDC | — | — | — | — | -0.381 (0.190) | — | — |
| P = 0.045 | |||||||
| Other diagnostic criteria vs. RDC | — | — | — | — | -1.487 (0.268) | — | — |
| P = 0.000 | |||||||
| Interviewed women with positive screens only vs. all | — | — | — | — | — | -0.096 (0.140) | — |
| P = 0.490 | |||||||
| Quality score | — | — | — | — | — | — | -0.014 (0.051) |
| P = 0.783 | |||||||
Notes: Estimated coefficients are shown along with their standard errors in parentheses and the P -value for a test of statistically significant differences from zero. P-values shown in bold type are significant at the < 0.05 level.
DSM III-R, Diagnostic and Statistical Manual of Mental Disorders, Third Edition; DSM IV, Diagnostic and Statistical Manual of Mental Disorders, Fourth Edition; PP, postpartum; RDC, Research Diagnostic Criteria; SADS, Schedule for Affective Disorders and Schizophrenia; SCID, Structured Clinical Interview for DSM-IV; SES, socioeconomic status.
| Explanatory Variables | Model | ||||||
|---|---|---|---|---|---|---|---|
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | |
| Constant | -3.206 (0.209) | -3.299 (0.272) | -3.447 (0.515) | -3.010 (0.373) | -3.419 (0.191) | -3.454 (0.239) | -1.677 (0.871) |
| P= 0.000 | P= 0.000 | P= 0.000 | P= 0.000 | P= 0.000 | P= 0.000 | P= 0.054 | |
| 1st trimester vs. | 0.052 (0.516) | 0.033 (0.523) | -0.180 (0.558) | -0.086 (0.558) | 0.271 (0.447) | 0.278 (0.512) | -0.085 (0.514) |
4 to 12 mos PP | P = 0.920 | P = 0.950 | P = 0.747) | P = 0.877 | P = 0.545 | P = 0.587 | P = 0.868 |
| 2nd trimester vs. | 0.375 (0.384) | 0.424 (0.399) | 0.471 (0.466) | 0.245 (0.444) | 0.517 (0.315) | 0.634 (0.392) | 0.339 (0.376) |
4 to 12 mos PP | P = 0.329 | 0.288 | P = 0.312 | P = 0.582 | P = 0.101 | P = 0.105 | P = 0.368 |
| 3rd trimester vs. | -0.272 (0.410) | -0.277 (0.415) | -0.372 (0.426) | -0.351 (0.436) | -0.052 (0.354) | -0.012 (0.417) | -0.379 (0.406) |
4 to 12 mos PP | P = 0.507 | P = 0.503 | P = 0.382 | P = 0.421 | P = 0.883 | P = 0.976 | P = 0.350 |
| 1 month PP vs. | 0.021 (0.356) | -0.033 (0.377) | -0.185 (0.409) | -0.168 (0.415) | -0.073 (0.290) | 0.060 (0.342) | 0.035 (0.349) |
4 to 12 mos PP | P = 0.954 | P = 0.929 | P = 0.651 | P = 0.686 | P = 0.800 | P = 0.861 | P = 0.920 |
| 2 mos PP vs. | 0.557 (0.291) | 0.533 (0.300) | 0.538 (0.304) | 0.377 (0.356) | 0.573 (0.257) | 0.613 (0.279) | 0.465 (0.289) |
4 to 12 mos PP | P = 0.056 | P = 0.076 | P = 0.077 | P = 0.290 | P= 0.026 | P= 0.028 | P = 0.107 |
| 3 mos PP vs. | 0.231 (0.342) | 0.228 (0.347) | 0.097 (0.364) | 0.139 (0.364) | 0.159 (0.273) | 0.279 (0.328) | 0.180 (0.336) |
4 to 12 mos PP | P = 0.499 | 0.510 | P = 0.791 | P = 0.703 | P = 0.561 | P = 0.395 | P = 0.592 |
| Low risk | -1.501 (0.671) | -1.436 (0.687) | -1.045 (0.822) | -1.537 (0.695) | -1.340 (0.609) | -1.384 (0.658) | -1.915 (0.702) |
| P= 0.025 | P= 0.036 | P = 0.204 | P= 0.027 | P= 0.028 | P= 0.036 | P= 0.006 | |
| Low SES | 0.459 (0.323) | 0.428 (0.333) | 0.759 (0.497) | 0.379 (0.345) | 0.498 (0.262) | 0.432 (0.308) | 0.636 (0.331) |
| P = 0.155 | P = 0.199 | P = 0.126 | P = 0.273 | P = 0.057 | P = 0.161 | P = 0.054 | |
| Publication year | — | 0.010 (0.020) | — | — | — | — | — |
| P = 0.604 | |||||||
| Other western countries vs. US | — | — | 0.238 (0.470) | — | — | — | — |
| P = 0.612 | |||||||
| Asian countries vs. US | — | — | 0.602 (0.533) | — | — | — | — |
| P = 0.258 | |||||||
| SCID vs. SADS | — | — | — | 0.112 (0.332) | — | — | — |
| P = 0.736 | |||||||
| Other interview type vs. SADS | — | — | — | -0.201 (0.306) | — | — | — |
| P = 0.511 | |||||||
| DSM III-R vs. RDC | — | — | — | — | 0.815 (0.241) | — | — |
| P= 0.001 | |||||||
| DSM IV vs. RDC | — | — | — | — | -0.198 (0.356) | — | — |
| P = 0.578 | |||||||
| Other diagnostic criteria vs. RDC | — | — | — | — | 0.414 (0.218) | — | — |
| P = 0.058 | |||||||
| Interviewed women with positive screens only vs. all | — | — | — | — | — | 0.441 (0.222) | — |
| P= 0.047 | |||||||
| Quality score | — | — | — | — | — | — | -0.132 (0.073) |
| P = 0.072 | |||||||
Notes: P-values shown in bold type are significant at the < 0.05 level.
DSM III-R, Diagnostic and Statistical Manual of Mental Disorders, Third Edition; DSM IV, Diagnostic and Statistical Manual of Mental Disorders, Fourth Edition; PP, postpartum; RDC, Research Diagnostic Criteria; SADS, Schedule for Affective Disorders and Schizophrenia; SCID, Structured Clinical Interview for DSM-IV; SES, socioeconomic status.
| Diagnosis Estimate Type Author, Year | Time Period | Odds Ratio | 95% Confidence Interval |
|---|---|---|---|
| Major and Minor Depression | |||
| Point | |||
| O'Hara et al., 199053 | 2nd trimester | 1.41 | 0.61–3.26 |
| O'Hara et al., 199053 | 9 weeks PP | 1.37 | 0.67–2.83 |
| Cox et al., 199320 | 6 months PP | 1.00 | 0.54–1.84 |
| Period | |||
| Cox et al., 199320 | Birth to 6 months PP | 1.04 | 0.61–1.76 |
| Incidence | |||
| Cox et al., 199320 | Birth to 5 weeks PP | 3.26* | 1.17–9.06 |
| Cox et al., 199320 | Birth to 6 months PP | 1.48 | 0.77–2.82 |
| Major Depression | |||
| Point | |||
| O'Hara et al., 199053 | 2nd trimester | 1.28 | 0.47–3.51 |
| O'Hara et al., 199053 | 9 weeks PP | 1.33 | 0.45–3.90 |
| Cooper et al., 198852 | 3 months PP | 0.85 | 0.33–2.17 |
| Cox et al., 199320 | 6 months PP | 1.00 | 0.37–2.71 |
| Cooper et al., 199633 | 6 months PP | 1.53 | 0.65–3.58 |
| Cooper et al., 199633 | 12 months PP | 0.50 | 0.17–1.46 |
| Period | |||
| Cox et al., 199320 | 6 months PP | 1.16 | 0.54–2.51 |
Statistically significant at P < 0.05.
PP, postpartum.
The prevalence estimates from the retrospective studies measure something different than the prospective studies. In the prospective studies, all study women recruited from prenatal clinics or maternity wards were screened and interviewed for depression. Thus, all (or nearly all) women with depression in the populations so defined are identified. In the retrospective studies, only those women with depression detected through the course of medical contacts during the year were identified.
In 1997-1998, universal screening for depression with the EPDS at the 6-week postpartum visit was implemented in Olmsted County. As a result, the prevalence of a diagnosis of major depression among postpartum women rose to 10.7 percent, suggesting that the screening score posted in medical charts led clinicians to become more aware of their patients' mental state.55
We found 30 studies providing estimates of the prevalence of perinatal depression but only 13 providing estimates of the incidence of the disorder. The studies were generally of moderate size—too small for reliable subgroup analyses. Furthermore, the study populations were typically restricted to a local community or geographic region served by one provider or a small number of providers of obstetrical services and were not representative of the racial and ethnic mix of the countries in which the studies were conducted. Other confounders included the risk status of women at study entry, their socioeconomic status, the interview methods, and the diagnostic criteria used to identify cases.
Combining point prevalence estimates of depression assessed at the same point in time and distinguishing whether they included minor depression, we found that the best estimates of the point prevalence of major and minor depression ranged from 8.5 percent to 11.0 percent at different times during pregnancy and from 6.5 percent to 12.9 percent at different times during the first year postpartum. Including only major depression, the best point prevalence estimates ranged from 3.1 percent to 4.9 percent at different times during pregnancy and from 1.0 percent to 5.9 percent at different times during the first postpartum year.
Period prevalence estimates show that as many as 19.2 percent of women have a depressive episode during the first 3 months postpartum, with as many as 7.1 percent having a major depressive episode during this time. Most of these episodes began following delivery. Incidence estimates show that, during the same 3-month period, 14.5 percent of women had a new depressive episode with as many as 6.5 percent having a major depressive episode. However, all of these estimates have wide 95% CIs, indicating that the amount of uncertainty in their precise values is considerable.
Our best estimates of prevalence and incidence were somewhat lower than those found in prior systematic reviews because we excluded studies that assessed depression based on self-report screens alone, which tend to overestimate prevalence. In addition, we separate out estimates of major and minor depression from estimates of major depression alone and estimates of point prevalence from estimates of period prevalence. Finally, we include more recent studies that use more precise criteria to identify major depression.
We found that the available evidence does not support the hypothesis that the prevalence of depression is higher during pregnancy or in the first year postpartum compared to nonchildbearing times. A single study suggested that the incidence of new depressive episodes (major and minor) is greater in the first 5 weeks postpartum than at other times.20
Nevertheless, pregnancy and the early postpartum period provide opportunities to screen for depression through regular prenatal and postpartum physician contacts. Because the poor outcomes of suffering from depression during the perinatal period can be farther reaching—affecting not only the woman but her newborn child and other family members—it behooves us to investigate the efficacy of screening and treatment programs for these women.
Screening for perinatal depression is an important first step in identifying women who are at risk of having perinatal depression. It is only an initial step—after a positive screen, a depressive illness must be confirmed by a follow-up diagnostic examination and determination by a clinician.
To be useful screening tools, instruments must be able to identify accurately and reliably the illness in the population of interest; they also need to rule out, accurately, persons in the population who do not have the illness. Assessment of a screening test's accuracy depends on knowing whether a disease is truly present, i.e., comparison to a reference standard. This section addresses the second Key Question (KQ) from the Safe Motherhood Group (SMG) and the Agency for Healthcare Research and Quality (AHRQ): “What is the accuracy of different screening tools for detecting depression during pregnancy and during the postpartum period?”
The two most commonly used measures of accuracy are sensitivity and specificity. Sensitivity refers to the proportion of patients with a disease who test positive (“true positives”) using a screening tool. A sensitive test is one that is usually positive in the presence of disease. In general, a highly sensitive test should be selected when the consequence of missing a disease would be a clearly bad outcome. Screens with high sensitivity are most useful to clinicians when the result is negative; negative results can help rule out a disease.
Specificity refers to the proportion of patients without a disease who test negative (“true negatives”) using the screening tool. A specific test is one that is usually negative in the absence of disease. A highly specific test, then, should be selected when false-positive results can substantially harm the patient in some way. Screens with high specificity are most useful to the clinician when the result is positive; the positive result can rule in the disease.
Screening tools have varying sensitivities and specificities as a function of which cutoff point, or threshold, clinicians and others use. The optimal cutoff depends on prevalence of disease (as explored in Chapter 3), benefits and harm of therapy, and risks and costs of administering the screening test.
Chapter 2 provides the detailed methods we used to search and review the literature on screening instrument accuracy. In this discussion, we elaborate on some of these methods.
Studies to be retained had to report directly or to provide data allowing us to calculate our primary outcomes of interests—sensitivity and specificity. We required that the screening instrument be compared to a reference standard for a diagnosis of depression. Reference standards could be one of two types. The first includes a clinical assessment by a mental health professional based on criteria from the Diagnostic and Statistical Manual of Mental Disorders (DSM), the Research Diagnostic Criteria (RDC), the Bedford College Checklist,69 or the International Classification of Diseases (ICD). The second involves a research-based diagnosis obtained by structured or semistructured clinical interview, such as the Structured Clinical Interview for Depression (SCID), the Diagnostic Interview Schedule (DIS), the Schedule for Affective Disorders and Schizophrenia (SADS), or Goldberg's Standardized Psychiatric Interview (SPI); each of these confirms a diagnosis based on one of the above systems of criteria.
Depressive illness can be either a major depressive disorder or a minor depression. The latter is understood to be an impairing, episodic depression with clear symptoms exceeding a normal state but without severity reaching the diagnostic criteria for major depressive disorder. For this chapter, we are concerned with the ability of screening tools to detect either major depression or minor depression in a given individual (because an individual can have only one or the other of these diagnoses), so the terminology intentionally differs from that used in Chapter 3.
We excluded studies that included patients with a known current depressive illness (for whom a screen would not provide new information). Furthermore, we excluded studies on women with bipolar disorder or a primary psychotic disorder and studies in which women with diagnosed depression could not be distinguished from women with maternity blues, a transient, subthreshold cluster of depressive symptoms commonly described in up to 50 percent of postpartum women.
Our main outcomes of interest were sensitivity and specificity of the screening approaches or instruments as described in the selected articles. When calculating outcomes ourselves or doing other analyses, we used Stata, version 8. For each reported instrument and associated cutoff, we calculated sensitivity and specificity from the published data. We constructed 95% confidence intervals (CIs) using exact methods. For instruments with three or more outcome values reported, we created plots of the sensitivity or specificity with associated 95% CIs to provide a graphic description of the degree of consistency of results. In addition, where possible we estimated pooled sensitivity and specificity values using meta-analytic methods for fixed effects. We evaluated heterogeneity using the Q statistic test for homogeneity. In several circumstances, pooled estimates were impossible to calculate because of perfect estimates of sensitivity (i.e., 100 percent) with associated variance estimates equal to 0.
We developed a quality rating form for these articles on screening accuracy from criteria identified by the Cochrane Methods Working Group on Systematic Review of Screening and Diagnostic Tests.25 The quality rating forms, provided in Appendix B, rated reporting, external validity, and internal validity. The senior abstractor completed the quality rating form for each article; another project team member reviewed a sample of the completed forms for accuracy and completeness.
We rated retained studies on three separate categories of quality then summed the individual category scores for a total score. The domains and maximum points possible for each domain are as follows:
Reporting (domain score of 10): Nine items covering study aims, description of depression assessment, potential confounders described, and instrument procedures described, each scored yes or no (1 or 0), except for an item concerning principal confounders that was scored yes, partially, or no (2, 1, or 0). We considered 0 to 3 as poor, 4 to 7 as fair, and 8 to 10 as good.
External validity (domain score of 3): Three items relating to representativeness of populations from which people were recruited and of settings and clinicians that treat such patients, each scored yes or no (1 or 0). We considered 0 or 1 as poor, 2 as fair, and 3 as good.
Internal validity (domain score of 8): Six items relating to both bias and confounding in the use of the screen and reference standard, each scored yes or no (1 or 0), except for an item assessing whether all screens were done independently on each person, all tests done on each person but not independently, or different tests done on different persons and not randomly allocated (2, 1, or 0, respectively). We considered 0 to 2 as poor, 3 to 5 as fair, and 6 to 8 as good.
The maximum total quality score was 21. We considered 0 to 7 as poor, 8 to 14 as fair, and 15 to 21 as good.
Our literature review of screening tools for detecting depression during pregnancy and the postpartum period identified no relevant systematic reviews. We did find 23 studies meeting our inclusion criteria. Of these, 10 were studies involving screening instruments in English;32, 46, 70–77 13 involved non-English screening instruments.31, 35, 42, 43, 50, 51, 78–84
| Author, Year | Place/Sample Size | Depression Type and Prevalence | Screening Method(s) and Cutoffs Used | Timing of Screenings | Criterion Standard |
|---|---|---|---|---|---|
| Prenatal Period | |||||
| Murray and Cox, 199046 | UK 100 | Major depression: 6% major or minor depression: 14% | EPDS: cutoffs vary from ≥ 11 to ≥ 15 | 28 to 34 weeks GA | SPI to obtain RDC diagnosis |
| Postpartum Period | |||||
| Ballard et al., 199470 | UK 200 | Major depression alone: 12% | EPDS: cutoff 13 | 6 months PP | PAS to obtain RDC diagnosis |
| Beck and Gable, 200171 | US 150 | Major depression alone: 12% | PDSS ≥ 81 | Between 2nd and 12th week PP | SCID-DSM-IV for DSM-IV diagnosis |
| Major or minor depression: 19% | EPDS ≥ 13 | ||||
| BDI-II ≥ 21 | |||||
| Boyce et al., 199372 | Australia 103 | Major depression alone: 9% | EPDS ≥ 13 | ≤6 months PP | DIS to obtain DSM-III-R diagnosis |
| GHQ: NR | |||||
| Pitt Scales: NR | |||||
| Campbell and Cohn, 199132 | US 1,007 | Major or minor depression: 9% | CES-D | 6 to 8 weeks after delivery | Modified SADS to obtain RDC diagnosis |
| Cox et al., 199673 | UK 128 | Major depression alone: 6% | EPDS ≥ 13 (primarily) but also ≥ 10, ≥ 11, ≥ 12, ≥ 14, ≥ 15 | Not reported in relationship to time of birth | SPI to obtain RDC diagnosis |
| Major or minor depression: 16% | |||||
| Harris et al., 198974 | Wales 147 | Major depression alone: 15% | BDI: ≥ 11 | 6 to 8 weeks PP | Clinical examination for DSM-III criterion |
| EPDS: ≥ 13 | |||||
| Leverton and Elliott, 200075 | England 199 | Major or minor depression: | EPDS ≥ 13 | 3 months PP | PSE with 2 standards used: Bedford College and Catego diagnosis |
| Catego: 5%; Bedford: 8% | |||||
| Murray and Carothers, 199076 | England 646 | Not provided, but data suggest major depression alone: 6% | EPDS ≥ 13 | 6 weeks PP | SPI to obtain RDC diagnosis |
| Major or minor depression: 15% | |||||
| Whiffen, 198877 | Canada 120 | Major or minor depression: 18% | BDI ≥ 10 | 6 to 8 weeks PP | SADS to obtain RDC diagnosis |
BDI, Beck Depression Inventory; CES-D, Center for Epidemiological Studies-Depression scale; DIS, Diagnostic Inventory Schedule; DSM-III-R, Diagnostic and Statistical Manual of Mental Disorders, third edition, revised; DSM-IV, Diagnostic and Statistical Manual of Mental Disorders, fourth edition; EPDS, Edinburgh Postnatal Depression Scale; GA, gestational age; GHQ, General Health Questionnaire; PAS, Psychiatric Assessment Schedule; PDSS, Postpartum Depression Screening Scale; PP, postpartum; PSE, Present State Examination; RDC, Research Diagnostic Criteria; SADS, Schedule for Affective Disorders and Schizophrenia; SCID, Structured Clinical Interview for DSM-III-R; SPI, Standardized Psychiatric Interview.
Unfortunately, the racial and ethnic mix of the study populations for the studies using English language screening instruments was poorly representative of the US population (our target of interest). Of the 10 studies, only the two studies conducted in the United States reported race and ethnicity.32, 71 These populations were overwhelmingly Caucasian; in by far the largest study,32 100 percent of the 1,007 women enrolled were white, and, in the other, 87 percent of the women were white.71
When reported, the mean age of women in these studies ranged from approximately 24 to 31 years. Of these 10 studies, only one was conducted during pregnancy.46 The remaining nine studies were conducted postpartum between 2 weeks and 6 months after delivery, with most occurring between weeks 8 and 12. Individual study sizes ranged from 103 to 1,007, with an aggregate sample size of 2,800.
Studies might use one or more screening tools; the selected articles evaluated four different screening tools.
| Screening Tool | Method of Administration | Number of Items | Score Ranges | Time to Complete | Time Frame Covered |
|---|---|---|---|---|---|
| EPDS | Self-administered | 10-item* | 0–30 | < 5 minutes | In the past 7 days |
| 13-item | 0–39 | ||||
| BDI† | Interviewer- or self-administered | 21-item | 0–63 | 5–10 minutes | Last week including today |
| BDI-II† | Interviewer- or self-administered | 21-item | 0–63 | 5–10 minutes | During the past 2 weeks |
| PDSS | Self-administered | 35-item | 35–175 | 5–10 minutes | Over the past 2 weeks |
| CES-D | Self-administered | 20-item | 0–80 | 1–2 minutes | Past 7 days |
The 10-item EPDS is more commonly administered than the 13-item version.
BDI and BDI-II were originally designed to be administered by an interviewer but are most often self-reported. rating
Three studies assessed the Beck Depression Inventory (BDI).71, 74, 77 The BDI is a list of 21 symptoms and attitudes that are each rated on intensity.86 Versions include the BDI, which uses “last week, including today” as the time frame for symptoms;86 the BDI-II, which uses 2 weeks as the time frame for symptoms;87 and the BDI-PC, which also has a 2-week time frame.88 The versions used most often (BDI or BDI-II) are scored by summing the ratings that respondents give to the 21 items. Although originally designed to be administered by trained interviewers, it is most often self-administered and takes 5 to 10 minutes to complete. This instrument has been used to measure severity of depression in depressed samples and also to assess depression in general population samples. Because of its reliance on somatic symptoms, some experts worry that it may produce higher scores and more false-positive results in pregnant women than in other respondents.
One study used the Postpartum Depression Screening Scale (PDSS).71 The PDSS is a 35-item Likert-type self-report instrument created specifically for new mothers that can be administered in 5 to 10 minutes. Written at a third-grade reading level, PDSS items are brief and easy to understand. Mothers respond using a 5-point scale ranging from “strongly disagree” to “strongly agree.” The test yields an overall severity score falling into one of three ranges: normal adjustment, significant symptoms of postpartum depression, and positive screen for major postpartum depression. The PDSS also provides scores for seven symptom areas: Sleeping/Eating Disturbances, Anxiety/Insecurity, Emotional Lability, Mental Confusion, Loss of Self, Guilt/Shame, and Suicidal Thoughts.
Another study used the Center for Epidemiological Studies Depression Scale (CES-D).32 The CES-D was designed to measure current level of depressive symptomatology and especially depressive affect.89 The 20 items were chosen from five previously used depression scales to represent all major components of depressive symptomatology, and it was designed to apply to a general population. Each item is rated on 4-point scales indicating the degree of its occurrence during the past week. The scales range from “rarely or none of the time” to “most all of the time.” The scale can distinguish between clinical groups and general community groups. It takes approximately 5 to 10 minutes to complete; scoring takes about 1 to 2 minutes. Although it is usually scored continuously, various cutoff scores for clinical depression have reasonable associations with a clinical diagnosis. A cutoff score of 16 or higher has been suggested as a positive screen for depression.89
Investigators used a variety of strategies to confirm the diagnosis of depression. Six studies used the RDC65 for depressive illness as the reference standard but employed different instruments to identify patients meeting this standard. Three studies used the Standardized Psychiatric Interview,46, 73, 76 two studies used a version of the Schedule for Affective Disorders and Schizophrenia,32, 77 and one study used the Psychiatric Assessment Schedule (an adaptation of the Present State Examination).70
Other reference standards were also employed. Beck and Gable used the Structured Clinical Interview for DSM IV to confirm the diagnosis of depressive illness per DSM IV criteria;71 Boyce et al. used the Diagnostic Interview Schedule, based on DSM III-R criteria, as the reference standard to confirm depressive illness;72 Harris et al. used a clinical assessment of whether a patient's presentation met DSM III criteria for depressive illness;74 and Leverton and Elliott used the Present State Examination to identify whether patients met depressive illness criteria by either the Bedford College Criteria or the Catego criteria (based on ICD-8 criteria).75
Investigators classified depressive illness into one of two categories that reflected how perinatal depression is described in the scientific literature: major depression alone or major or minor depression. Patients identified as major depression alone met criteria for an episode of severe depressive illness according to the standardized criteria. In this report, we refer to major depressive episodes as major depression. For major depressive disorders, clearly effective interventions have been identified in clinical trials. Seven studies provided this classification.46, 70–74, 76
The point prevalence for major depression alone was 6 percent in the single prenatal study,46 somewhat higher than the 3.1 percent “best estimate” that we discussed in Chapter 3. For the postpartum studies, the point prevalence for the six studies reporting on major depression alone ranged from 6 percent to 15.5 percent;70–74, 76 this frequency is somewhat higher than the postpartum results from KQ 1 showing a best estimate prevalence between 1 and 3 months postpartum of 3.8 percent and 4.7 percent, respectively.
The major or minor depression category of depressive illness requires that patients meet diagnostic criteria for either a major depressive episode or a minor depressive episode. Minor depression is an impairing yet less severe constellation of depressive symptoms13 for which controlled trials have not consistently indicated that particular interventions are more effective than placebo.14, 15 In this report, we refer to this grouping as major or minor depression, or by the more general terms of “depression” or “depressive illness.” Seven studies classified depression in this way.32, 46, 71, 73, 75–77
In the single prenatal screening study, the point prevalence of major or minor depression in the third trimester (14 percent) was greater than our best estimate from KQ 1 for this time period (8.5 percent).73 For the postpartum studies, prevalence rates ranged from 5 percent to 19 percent; these figures are somewhat higher than our best estimate range for point prevalence of 9.7 percent to 12.9 percent in the first 3 months postpartum. Given that this distinction substantially affects screening accuracy at a particular cutoff, we sort the results below by these two case definitions.
| Author, Year | Reporting (10) | External Validity (3) | Internal Validity (8) | Total Score (21) |
|---|---|---|---|---|
| Studies with Screener in English | ||||
| Prenatal Period | ||||
| Murray and Cox, 199046 | 5 | 3 | 8 | 16 |
| Postpartum Period | ||||
| Ballard et al., 199470 | 9 | 1 | 8 | 18 |
| Beck and Gable, 200171 | 6 | 1 | 8 | 15 |
| Boyce et al., 199372 | 8 | 3 | 5 | 16 |
| Campbell and Cox, 199132 | 8 | 3 | 8 | 19 |
| Cox et al., 199673 | 5 | 0 | 8 | 13 |
| Harris et al., 198974 | 5 | 2 | 6 | 13 |
| Leverton and Elliott, 200075 | 5 | 0 | 7 | 12 |
| Murray and Carothers, 199076 | 4 | 3 | 8 | 15 |
| Whiffen, 198877 | 6 | 0 | 4 | 10 |
| Average | 6.1 | 1.6 | 7.0 | 14.7 |
Note: Maximum possible score is shown in parentheses.
In summary, one prenatal screening study is of good quality. However, the inclusion of only six women with major depression substantially limits conclusions about the accuracy of prenatal depression screens. Indeed, the sensitivity results at 100 percent for each cutoff dramatically underscore the small number of depressed patients involved.
Results for major or minor depression from this one study are similarly limited. Only 14 depression cases are involved. Sensitivity and specificity estimates appeared to be lower than those for major depression alone. In particular, sensitivity estimates appeared worse than those for major depression alone, but again CIs are wide.
| Author, Year | Cutoff (≥) | Point Estimate for Sensitivity 95% CI | Point Estimate for Specificity 95% CI |
|---|---|---|---|
| Prenatal period | |||
| EPDS, Major depression | |||
| Murray and Cox, 199046 | 15 | 1.0 | 0.96 |
| 0.54–1.0 | 0.89–0.99 | ||
| 14 | 1.0 | 0.94 | |
| 0.54–1.0 | 0.87–0.98 | ||
| 13 | 1.0 | 0.87 | |
| 0.54–1.0 | 0.79–0.93 | ||
| 12 | 1.0 | 0.79 | |
| 0.54–1.0 | 0.69–0.86 | ||
| EPDS, Major or minor depression | |||
| Murray and Cox, 199046 | 14 | 0.57 | 0.95 |
| 0.29–0.82 | 0.89–0.99 | ||
| 13 | 0.64 | 0.90 | |
| 0.35–0.87 | 0.81–0.95 | ||
| 12 | 0.64 | 0.80 | |
| 0.35–0.87 | 0.70–0.88 | ||
| 11 | 0.71 | 0.72 | |
| 0.42–0.92 | 0.61–0.81 | ||
| Postpartum Period | |||
| EPDS, Major depression | |||
| Ballard et al., 1994 (13-item version)70 | 13 | 0.96 | 0.70 |
| 0.78–1.0 | 0.51–0.85 | ||
| Harris et al., 198974 | 13 | 0.95 | 0.93 |
| 0.77–1.0 | 0.87–0.97 | ||
| 10 | 1.0 | 0.82 | |
| 0.85–1.0 | 0.73–0.89 | ||
| Beck and Gable, 200171 | 13 | 0.78 | 0.99 |
| 0.52–0.94 | 0.96–1.0 | ||
| Boyce et al., 199372 | 13 | 1.0 | 0.96 |
| 0.67–1.0 | 0.89–0.99 | ||
| 10 | 1.0 | 0.89 | |
| 0.66–1.0 | 0.81–0.95 | ||
| Cox et al., 199673 | 13 | 0.75 | 0.84 |
| 0.35–0.97 | 0.76–0.90 | ||
| 12 | 0.88 | 0.76 | |
| 0.47–1.0 | 0.67–0.83 | ||
| 10 | 0.88 | 0.71 | |
| 0.47–1.0 | 062–0.79 | ||
| EPDS, Major or minor depression | |||
| Cox et al., 199673 | 13 | 0.62 | 0.89 |
| 0.38–0.82 | 0.81–0.94 | ||
| Cox et al., 199673 | 12 | 0.76 | 0.81 |
| 0.53–0.92 | 0.73–0.88 | ||
| Cox et al., 199673 | 10 | 0.81 | 0.77 |
| 0.58–0.95 | 0.67–0.84 | ||
| Beck and Gable, 200171 | 10 | 0.59 | 0.86 |
| 0.43–0.73 | 0.78–0.92 | ||
| Leverton and Elliott, 200075 (Bedford Criteria) | 13 | 0.44 | 0.92 |
| 0.38–0.82 | 0.87–0.95 | ||
| Leverton and Elliott, 200075 | 10 | 0.69 | 0.85 |
| 0.41–.89 | 0.79–.0.90 | ||
| BDI, Major depression | |||
| Beck and Gable, 200171 (BDI-II) | 21 | 0.56 | 1.0 |
| 0.31–0.78 | 0.97–1.0 | ||
| Harris et al., 198974 (BDI) | 21 | 0.32 | 0.99 |
| 0.13–0.57 | 0.95–1.0 | ||
| 13 | 0.63 | 0.92 | |
| 0.38–0.84 | 0.85–0.96 | ||
| 11 | 0.68 | 0.88 | |
| 0.43–0.87 | 0.82–0.94 | ||
| BDI, Major or minor depression | |||
| Beck and Gable, 200171 (BDI-II) | 15 | 0.57 | 0.97 |
| 0.41–0.71 | 0.92–1.0 | ||
| Whiffen, 198877 (BDI) | 10 | 0.48 | 0.86 |
| 0.26–0.70 | 0.78–0.92 | ||
| PDSS, Major depression | |||
| Beck and Gable, 200171 | 81 | 0.94 | 0.98 |
| 0.73–1.0 | 0.94–1.0 | ||
| PDSS, Major or minor depression | |||
| Beck and Gable, 200171 | 61 | 0.91 | 0.72 |
| 0.79–0.98 | 0.62–0.80 | ||
| CES-D, Major or minor depression | |||
| Campbell and Cohn, 199132 | 16 | 0.60 | 0.92 |
| 0.50–0.70 | 0.90–0.93 | ||
| 21 | 0.43 | 0.97 | |
| 0.33–0.54 | 0.95–0.98 | ||
BDI, Beck Depression Inventory; CES-D, Center for Epidemiological Studies - Depression Scale; CI, confidence interval; EPDS, Edinburgh Postnatal Depression Scale; PDSS, Postpartum Depression Screening Scale.
For the Ballard et al. study employing the 13-item version (n = 23 depressed women),70 we used only the cutoff of ≥ 13. Mean sensitivity was 0.96 and mean specificity was 0.70, with relatively wide CIs for both point estimates.
Specificities ranged from 0.84 to 0.99 and appeared to be more precise than sensitivities, as indicated by the much narrower CIs. Of note, results at this threshold from these individual studies of the 10-item screen indicated that sensitivities were similar to the value reported in the one 13-item screen study, but specificities were higher with the 10-item version.
We attempted to conduct a meta-analysis of the sensitivity results from the four studies using the cutoff point of 13 or greater. The Boyce et al. study72 reported a sensitivity point estimate of 1.0, thus we were unable to generate a meaningful standard error; consequently, we could not include this result in the sensitivity meta-analysis. Leaving this study out, our meta-analysis produced a sensitivity point estimate of 0.91 (95% CI, 0.84 to 0.99); the test for heterogeneity was not significant (P= 0.141). We were able to include all four studies in our meta-analysis of specificity, but heterogeneity was significant (P < 0.001), precluding a pooled specificity estimate.
One study assessed a cutoff point of ≥ 12.73 It reported a sensitivity of 0.88 (with a wide CI) and a specificity of 0.76 (with a narrow CI).
Three studies reported a cutoff of ≥ 10, all producing estimates with imprecise sensitivities yet relatively precise specificities.72–74 Point estimates for sensitivity ranged from 0.8873 to 1.0.72, 74 Because two studies reported a perfect sensitivity of 1.0, we could not determine a pooled sensitivity estimate. Specificity ranged from 0.71 to 0.89, but heterogeneity was significant (P = 0.002), precluding a pooled estimate.
Two studies report a cutoff score of ≥ 13.73, 75 Sensitivities were low (0.6273 and 0.4475) and imprecise (wide CIs). Specificities were high (0.89 and 0.92, respectively) and quite precise. A meta-analysis at this cutoff produced a pooled sensitivity estimate of 0.54 (95% CI, 0.39 to 0.70) without significant heterogeneity (P = 0.266) and a pooled specificity estimate of 0.91 (95% CI, 0.88 to 0.94) without significant heterogeneity (P = 0.410).
One study reported a cutoff score of 12 or greater.73 Relative to a threshold of 13 or more, this score appeared to improve sensitivity and decrease specificity, with the precision remaining unchanged.
Three studies reported results with a cutoff score of ≥ 10.71, 73, 75 Reported sensitivities ranged from 0.59 to 0.81, and specificities ranged from 0.77 to 0.88. Again, sensitivity estimates were quite imprecise, whereas specificity estimates were quite precise. A meta-analysis of these results produced a pooled sensitivity estimate of 0.68 (95% CI, 0.58 to 0.78) without significant heterogeneity (P = 0.140). Specificities could not be pooled because of significant heterogeneity (P = 0.068).
For major or minor depression, two articles reported BDI test characteristics using different thresholds.71, 77 Beck and Gable,71 using a cutoff of ≥ 15 on the BDI-II, reported a sensitivity of 0.57 and a specificity of 0.97. The BDI study by Whiffen employed a cutoff of ≥ 10 and reported a sensitivity of 0.48 and a specificity of 0.86.77
Postpartum Depression Screening Scale. One study of the PDSS (150 patients) reported high sensitivity (0.94) and high specificity (0.98) for major depressive disorder alone at a cutoff of ≥ 80.71 The investigators also reported lower sensitivity (0.91) and lower specificity (0.72) for major or minor depression using a cutoff of ≥ 60.
Center for Epidemiological Studies - Depression Scale. One study of the CES-D (1,007 patients) used two cutoff points (≥ 21 and ≥ 16).32 It reported low sensitivity (0.60 and 0.43, respectively) and high specificity (0.92 and 0.97, respectively) for major or minor depression.
The available evidence for both major depression alone and major or minor depression together is characterized by studies including markedly low numbers of depressed patients, a narrow racial and ethnic mix, varying cutoff points, and varying reference standards. These factors combine to preclude definitive conclusions or recommendations about screening instruments or thresholds.
Screening Instruments. For major depression alone, all screening instruments investigated (EPDS, BDI, PDSS) provided similarly high degrees of specificity at various cutoffs. Because of wide CIs, however, conclusions about sensitivity are more restricted. Heterogeneity among the studies limited our ability to synthesize these results quantitatively. In most instances, we could not obtain a more precise estimate. For an EPDS cutoff of ≥ 13 for patients with major depression alone, sensitivity estimates were combined in a meta-analysis to produce a point estimate of 0.91; however, heterogeneity precluded a meta-analysis for a specificity point estimate.
The EPDS and PDSS (with point estimates ranging from 0.75 to 1.0 at various cutoffs) appeared to be more sensitive than the BDI instruments (0.32 to 0.68 at various cutoffs), but the wide CIs overlapped nearly completely. A recent meta-analysis of prevalence estimates found that, compared with structured clinical interviews, the EPDS produced statistically equivalent prevalence estimates whereas the BDI produced significantly higher estimates.27 Together, these findings suggest that a positive screen with EPDS may be more clinically useful than screens with the other instruments.
For major or minor depression, sensitivity point estimates for each tool at each cutoff were consistently lower than those for major depression alone, although specificities were quite similar to those for major depression alone. We were able to synthesize EPDS results quantitatively at a cutoff of ≥ 13, producing a sensitivity point estimate of 0.54 with a wide CI (95% CI, 0.39 to 0.70) and a specificity estimate of 0.91 with a narrow CI (95% CI, 0.88 to 0.94). At an EPDS cutoff of ≥ 10, we were able to produce a pooled sensitivity estimate of 0.68 (95% CI, 0.58 to 0.78), but heterogeneity precluded a pooled analysis for specificity.
In short, estimates of specificity are relatively precise, but estimates of sensitivity are imprecise. This pattern of results prevents any substantive conclusions about the accuracy of these tools for identifying true positives. This imprecision can be attributed to the consistently low number of patients with a depression diagnosis, a fact reflected by a number of studies reporting 100 percent sensitivity, and it is a major limitation of the currently available data. Because of this imprecision, we cannot meaningfully compare sensitivities of screening instruments.
Cutoff Points. For an individual screening instrument, we cannot make any substantive conclusions about the use of a particular cutoff point. As noted above, the wide CIs for sensitivity prevent one from confidently distinguishing one sensitivity result from another. However, two further guides that bear directly on the choice of a threshold need to be considered before a particular threshold could be suggested.
First, the relative cost, or value, of errors in screening tests (false-negative compared to false-positive results) needs to be clarified. False-negative results (miss true depression) can lead to bad outcomes such as continued morbidity, costs of unnecessary tests, and similar effects. By contrast, false-positive results (identifying depression when it is not there), can lead to unnecessary time, effort, and financial cost for diagnostic workup as well as potential side effects of a treatment that is not indicated. If false-negative and false-positive results are equally bad, then a screening test should try to minimize both equally to identify the most effective cutoff.
If missing depression in a patient is worse than falsely identifying depression in a patient (i.e., a false-negative classification is worse than a false-positive one), then one would want a test that maximizes sensitivity and has the highest negative predictive value. Said another way, the preferred test would be one in which the greatest proportion of those screening negative do not have the disease. By contrast, if falsely identifying a patient as having depression (a false positive) is worse, then one would want a test that maximizes specificity and has the highest positive predictive value. Clinical intuition suggests that missing a diagnosis is worse than making an incorrect diagnosis. We could find no literature addressing the trade-off of false-positive versus false-negative diagnoses in this clinical situation.
A second important guide in choosing a cutoff is the prevalence of a disease in a particular population. Regardless of test characteristics, in populations in which the prevalence of depression is relatively high, the number of false-negative results is higher; in populations in which the prevalence is relatively low, the number of false-positive results is higher. Therefore, the choice of a test and cutoff may differ depending on whether the population has a higher prevalence of depression (e.g., a high-risk postpartum clinic) or a lower prevalence (e.g., a healthy baby clinic). As a result, these three variables—sensitivity and specificity, the predictive value of screening errors (false positives versus false negatives), and the prevalence of the disease—must be clarified before clinicians or researchers can choose a specific test and related cutoff.
The above limitations notwithstanding, the tools we have reviewed above appear to be able to identify depressive illness in pregnant and postpartum women with a degree of accuracy similar to that for depression screen results in other nonpsychiatric settings. Screening results in primary care for a combined major or minor depression group are not available, but the results in primary care settings for major depression alone are similar to those reported for perinatal depression. For example, in a synthesis of depression case-identifying instruments in primary care settings using selection criteria similar to ours, Williams et al. reported a median sensitivity for major depression of 85 percent (range, 50 percent to 97 percent), and a median specificity of 74 percent (range, 51 percent to 98 percent).90 This review included both women and men, which might explain the lower measures of accuracy; female gender appears to improve the accuracy of depression screens in primary care settings.91
The small numbers of relevant articles limits our interpretation of the results. Given that most of the articles address the EPDS, we will use this instrument as an example. Because of the reports of 100 percent sensitivity in the prenatal tests of the EPDS (underscoring the very small number of prenatal depressed patients involved), we consider application of our results only to the postpartum population, and we draw on the prevalence data reported in Chapter 3 for KQ 1. We caution that, given the low numbers of depressed patients in the postpartum studies, the sensitivity estimates are likely to be inaccurate. Also, the majority of postpartum screens were performed 6 to 8 weeks after delivery, so the examples below apply only to that time period.
For major depression alone, the estimated point prevalence for the 6- to 8-week postpartum period is 6.8 percent, although the confidence interval around this estimate is wide. EPDS screens using the most commonly cited cutoff of 13 have a sensitivity of 91 percent and a specificity of approximately 95 percent. To illustrate this scenario, consider using this tool and cutoff for 1,000 patients. This EPDS screen would produce 62 true-positive cases and 6 false-negative cases, and 47 false-positive cases and 885 true-negative cases. The positive predictive value is 57 percent, meaning that the probability that a woman with a positive screen truly has major depressive disorder is slightly more than half. The negative predictive value (i.e., the probability that a woman with a negative screen would not have depressive illness) is 99 percent.
For major or minor depression, the estimated point prevalence from KQ 1 is 11.3 percent. EPDS screens tested for this population most commonly reported a cutoff of 10. This threshold at 6 to 8 weeks postpartum has a sensitivity of 68 percent and a specificity of approximately 80 percent. For 1,000 patients, the screen would produce 77 true-positive cases and 36 false-negative results, and 177 false-positive cases and 710 true-negative cases. The positive predictive value is 30 percent, and the negative predictive value is 95 percent.
Very little is known about the accuracy of depression screening tests in pregnant and postpartum women. The available evidence is limited in several ways. It has a very narrow racial and ethnic mix. Study samples have prevalence rates of depression that are, by design, somewhat higher than our best estimate prevalence rates from KQ 1 (which would produce a higher positive predictive value). Most important, the available data involve small numbers of depressed patients. We could not address the limits of the small numbers of depressed patients using meta-analytic procedures. Case definitions, reference standards, screening tools, and screening thresholds all varied across the studies, and the heterogeneity of study methods constrained our ability to synthesize the data and obtain pooled estimates.
Despite these limitations, the available evidence does indicate that depression screens are feasible to administer in perinatal settings. It also suggests that the estimates of sensitivity and specificity, although limited, appear equivalent to those that have been reported in primary care settings. In particular, specificity is relatively good, suggesting a relatively good positive predictive value.
Further studies in this area need to standardize the above parameters we have examined in this chapter (instruments and, in particular, cutoff points), involve a more representative mix of racial and ethnic groups, test the screening tools in populations with a frequency of depression more reflective of the actual prevalence, and include a larger number of depressed patients to clarify the accuracy of depression screening tools and make them more relevant to the population of interest. Given the currently available evidence, we offer six future research recommendations.
First, subsequent studies on the test characteristics of screeners must be designed with sample size estimates that take into account prevalence and that project a reasonable width of sensitivity confidence intervals for the particular illness. For example, studies would need to screen 1,000 women to identify 34 with major depression or 110 with major or minor depression. This sample size might be enough for precise estimates for women with major or minor depression as a group, but it may not be enough for precise estimates for major depression alone.
Second, the sample should represent the target population. Specifically, subsequent studies need to provide a more representative racial and ethnic mix. In addition, studies should incorporate a range of other demographic variables that could influence screening performance, such as socioeconomic status measures, and assess the screening tools in these subpopulations.
Third, as in the Beck and Gable study,71 subsequent studies should assess and directly compare multiple screening instruments. This design provides a head-to-head comparison that allows researchers and clinicians to understand which screening instruments are more accurate than others in different settings.
Fourth, studies evaluating both the risks and benefits of screening, specifically assessing the relative cost of false-negatives and false-positive results, will provide insights on how to consider target sensitivity and specificity when attempting to maximize cost-effectiveness.
Fifth, subsequent depression screening studies should carefully consider whether to target major depression alone, for which beneficial treatments clearly exist, or the traditional combined category of major or minor depression, a heterogeneous group for which treatment benefit is unclear. Our results suggest that the sensitivity of screening instruments is generally greater for the major depression alone group.
Sixth, the bulk of the screening studies we reviewed were conducted in the first 3 months postpartum. Subsequent studies should examine screening not just in the first 3 months postpartum but also at 6 weeks, 6 months, and 12 months postpartum. If peak prevalence and incidence occur within the first 6 weeks, the obstetrics clinic is a prime place to target resources for such a program. If, however, peaks occur after this time, most postpartum women will have completed follow-up care with an obstetrician, so programs in an obstetrics clinic may be less helpful. In this case, it is possible that programs targeting new mothers in family medicine, internal medicine, or pediatric clinics might be more effective.
In the interim, what is a clinician to do? The best available evidence supports the conclusion that screening instruments with reasonable test characteristics appear feasible to use in a perinatal population with a depression prevalence between 5 percent and 10 percent. Given that use of the tools likely carries low risk, and that they all have reasonable specificity (and, thus, a reasonable positive predictive value), the selection of a tool would be guided by an interest in maximizing sensitivity. For the category of major or minor depression, sensitivity estimates were quite similar for all instruments. However, for major depression alone, sensitivity estimates for the EPDS and PDSS appear to be higher than those for the BDI. The standard cutoffs of ≥ 13 for the EPDS and ≥ 81 for the PDSS appear to be reasonable thresholds.
Having an instrument that can accurately identify women at risk of having perinatal depression is an important and necessary link in improving the clinical outcomes of women with perinatal depression: women who may benefit from a depression intervention first need to be recognized. Nonetheless, it remains merely an initial step. A more important question is whether screening pregnant or postpartum women to identify those at risk of having depression, and subsequently providing an intervention, ultimately leads to improved outcome. We address this key question in our next chapter.
In agreement with the Safe Motherhood Group and the Agency for Healthcare Research and Quality (AHRQ), we directed part of our work to Key Question (KQ) 3: Does prenatal or early postnatal screening for depressive symptoms with subsequent intervention lead to improved outcomes? That is, does screening for depression during pregnancy or the postpartum period and implementing an intervention improve outcomes related to maternal depressive symptoms.? To address KQ 3, we developed an analytic framework (Figure 19
As described in this chapter, screening can be done in various settings and with various instruments (as discussed for KQ 2). Interventions are both nonpharmacologic (e.g., counseling and behavioral intervention programs aimed at mothers or, in some cases, both parents or mother-infant dyads) and pharmacologic (e.g., antidepressants). These interventions can be implemented in various outpatient settings (e.g., clinics, homes) and delivered by various types of health professionals, and they may be group efforts or one-on-one activities.
Chapter 2 documentsthe methods we used to conduct literature searches and title and abstract or full article reviews. We did not identify any studies that specifically examined the cascade of screening-treatment-outcomes. Thus, we do not have any direct evidence pertaining to KQ3.
All the trials included for KQ3 are treatment studies that had a screening component (either a formal depression screening instrument or other type of screen that identified women at risk of a depressive illness). We included studies conducted worldwide in developed countries where the population could be generalized to pregnant and postpartum women in the United States, regardless of the language spoken. We also included both randomized controlled trials (RCTs) and prospective cohort studies. Additionally, for inclusion in KQ 3, patients were identified by a screen done either during pregnancy or during 12 months postpartum and considered to be “at risk” of having a depressive illness.
We excluded all case-control studies and studies in which patients had had a documented current depressive episode before the initial screening. Furthermore, we excluded two studies that had originally been reviewed for the feasibility study, one because it did not use any screening92 and one because it had no depression severity outcome.93
We attempted to synthesize the results of the included studies quantitatively, but the study methods (screening instruments, type of intervention, intensity of intervention, outcomes measured) were so heterogeneous that a combined result would have little meaning. We also attempted to compare effect sizes in an exploratory analysis of the various studies, but the data necessary to compute these were not available.
Appendix B presents the quality rating form used for articles considered for KQ 3. The total possible score for these studies was 29. We characterized studies with scores of 20 or greater as good, those with scores between 15 and 19 as fair, and those with scores of 14 and below as poor. The domains and maximum points possible for each domain are as follows:
Reporting (domain score of 11): 10 items covering study aims, measures, patient populations, findings, and statistical presentation; each scored yes or no (1 or 0), except for an item concerning principal confounders that was scored yes, partially, or no (2, 1, or 0, respectively).
External validity (domain score of 3): Three items relating to representativeness of populations from which people were recruited and of settings and clinicians that treat such patients; each scored yes or no (1 or 0).
Internal validity-bias (domain score of 7): Seven items relating to issues such as blinding subjects and outcomes assessors, follow-up periods, appropriate statistical tests, and use of reliable and valid outcome measures; each scored yes, no, or unable to determine (1, 0, or 0, respectively).
Internal validity-confounding (domain score of 6): Six items relating to sources of intervention and control groups, randomization of study subjects and concealment of allocation, adequacy of adjustments for confounding, and loss to follow-up; each scored yes, no, or unable to determine (1, 0, or 0, respectively).
Power (domain score of 2): One item about use of power analysis to determine sample size; scored no, yes for one measure, or yes for two or more measures (0, 1, and 2, respectively).
| Author, Year | Country | Study Design | Sample Size | Setting | Type of Screening | Type of Intervention |
|---|---|---|---|---|---|---|
| Screening during Pregnancy | ||||||
| Brugha et al., 200095 | UK | RCT | 209 | Prenatal clinic | Modified GHQ-D | Structured group prenatal preparation classes |
| Elliott et al., 200096 | UK | Nonran-domized controlled trial | 98 | Prenatal clinic | Leverton Questionnaire | Structured group prenatal preparation classes |
| Crown Crisp | ||||||
| Stamp et al., 199594 | Australia | RCT | 129 | Prenatal clinic | Modified prenatal questionnaire | Perinatal support group |
| Zlotnick et al., 200197 | US | RCT | 37 | Prenatal clinic | Screening survey | Four prenatal therapy/skills groups |
| Screening during Postpartum Period | ||||||
| Armstrong et al., 199998 | Australia | RCT | 181 | PP hospital ward | Adverse family risk factors from Brisbane Evaluation of Needs Questionnaire | Regular home visits by child-health nurses |
| Chabrol et al., 200299 | France | RCT | 859 screened | PP hospital ward | EPDS | One CBT prevention group during the delivery hospital stay followed by an at- home CBT- based program in women with major depression |
| 258 randomized | ||||||
| Chen et al., 2000100 | Taiwan | RCT | 414 screened | PP hospital ward | Taiwanese BDI | Four weekly PP support group sessions |
| 115 randomized | Measures of support | |||||
| Dennis 2003101 | Canada | RCT | 501 screened | Child immunization clinics | EPDS | Telephone-based peer support |
| 44 randomized | ||||||
| Fleming et al., 1992102 | Canada | Nonrandomized controlled trial | 781 screened | PP hospital ward | EPDS | PP social support group |
| 152 enrolled | CES | |||||
| MAACL | ||||||
| Hiscock and Wake, 2002103 | Australia | RCT | 155 screened | Child-health center | EPDS | Infant sleep intervention group |
| 99 randomized | ||||||
| Honey et al., 2002104 | UK | RCT | 45 randomized | Mother/baby clinic | EPDS | Psycho-educational group |
| Horowitz et al., 2001105 | US | RCT | 1,215 screened | Community sample of PP women | EPDS | Coached behavioral intervention to promote maternal-baby interaction |
| 122 randomized | ||||||
| Onozawa et al., 2001106 | UK | RCT | 59 | PP hospital ward | EPDS | Infant massage plus support group |
| Wisner and Wheeler, 1994107 | US | Open trial | 23 | PP hospital ward | Prior history of PPD | Antidepressant medication |
| Wisner et al., 2001108 | US | RCT | 581 screened | PP hospital ward | Prior history of PPD | Antidepressant medication |
| 56 randomized | ||||||
BDI, Beck Depression Inventory; CBT, cognitive behavioral therapy; CES, Current Experience Scale; EPDS, Edinburgh Postnatal Depression Scale; GHQ-D, General Health Questionnaire Depression Score; MAACL, Multiple Affect Adjective Checklist; PP, postpartum; PPD, postpartum depression; RCT, randomized controlled trial.
| Author, Year | Reporting (11) | External Validity (3) | Internal Validity-Bias (7) | Internal Validity-Confounding (6) | Power (2) | Total Score (29) |
|---|---|---|---|---|---|---|
| Screening during Pregnancy | ||||||
| Brugha et al., 200095 | 7 | 0 | 3 | 6 | 1 | 17 |
| Elliott et al., 200096 | 5 | 0 | 2 | 4 | 0 | 11 |
| Stamp et al., 199594 | 6 | 0 | 1 | 5 | 1 | 13 |
| Zlotnick et al., 200197 | 5 | 0 | 4 | 3 | 0 | 12 |
| Screening during Postpartum Period | ||||||
| Armstrong et al., 199998 | 9 | 1 | 5 | 3 | 1 | 19 |
| Brisco et al., 198993 | 6 | 1 | 4 | 3 | 0 | 14 |
| Chabrol et al., 200299 | 8 | 0 | 2 | 5 | 0 | 15 |
| Chen et al., 2000100 | 8 | 0 | 3 | 3 | 0 | 14 |
| Dennis, 2003101 | 9 | 2 | 5 | 6 | 0 | 22 |
| Fleming et al., 1992102 | 6 | 0 | 3 | 2 | 0 | 11 |
| Hiscock and Wake, 2002103 | 7 | 1 | 5 | 5 | 1 | 19 |
| Honey et al., 2002104 | 9 | 0 | 3 | 4 | 0 | 16 |
| Horowitz et al., 2001105 | 7 | 0 | 4 | 3 | 1 | 15 |
| Onozawa et al., 2001106 | 10 | 0 | 3 | 3 | 0 | 16 |
| Wisner and Wheeler, 1994107 | 8 | 0 | 3 | 1 | 0 | 12 |
| Wisner et al., 2001108 | 7 | 0 | 6 | 4 | 1 | 18 |
Note: Maximum possible score in parentheses.
| Author, Year | Type of Intervention | Outcome Measures | Significant Differences between Intervention and Control Group |
|---|---|---|---|
| Screening during Pregnancy | |||
| Brugha et al., 200095 | Structured group prenatal preparation classes | GHQ-D | No significant differences on any measure |
| EPDS | |||
| SCAN | |||
| Elliott et al., 200096 | Structured group prenatal preparation classes | EPDS | Intervention group had significantly lower EPDS scores in first time mothers; no significant difference on PSE for diagnosis of major depression |
| PSE | |||
| Stamp et al., 199594 | Perinatal support group | EPDS | No significant differences on this measure |
| Zlotnick et al., 200197 | Four prenatal therapy/skills groups | BDI | Intervention group had a significantly greater change over time; at follow-up, intervention group had a significantly lower level of maternal depression |
| SCID | |||
| Screening during Postpartum Period | |||
| Armstrong et al., 199998 | Regular home visits by child health nurses | EPDS | For secondary outcomes, intervention group had significantly lower depression scores and a positive effect on parent-infant interaction |
| PSI | |||
| Child health | |||
| HOME | |||
| Chabrol et al., 200299 | One CBT-based prevention group during the PP hospitalization, followed by an at- home CBT- based program in women with major depression | EPDS | Intervention group had significant reductions in frequency of depressive symptoms |
| HAM-D | |||
| BDI | |||
| Chen et al., 2000100 | Four weekly PP support group sessions | BDI | Intervention group had significant lower rates of depression and rates of perceived stress and more interpersonal support |
| PSS | |||
| ISEL | |||
| Dennis, 2003101 | Telephone-based peer support | EPDS | Intervention group had significantly lower EPDS scores |
| Fleming et al., 1992102 | PP social support group | EPDS | No significant differences on any measure |
| CES | |||
| Hiscock et al., 2002103 | Infant sleep intervention (controlled crying) group | EPDS | Intervention group members with higher depression scores at baseline had significantly greater improvement in EPDS scores and reported improvements in sleep quality |
| Maternal and infant sleep quality | |||
| Maternal stress | |||
| Honey et al., 2002104 | Psycho-educational group | EPDS | Intervention group had significant reductions in depressive symptoms |
| Horowitz et al., 2001105 | Coached behavioral intervention to promote maternal-baby responsiveness | BDI-II | No significant differences for maternal depression; intervention group showed significantly better mother-infant responsiveness |
| DMC | |||
| Onazawa et al., 2001106 | Infant massage classes plus support group | EPDS | No significant differences for maternal depression; intervention group showed significant improvements in mother-infant interaction |
| Videotape of mother-infant interaction | |||
| Wisner and Wheeler, 1994107 | Antidepressant (nortriptyline) | Clinical interview | Intervention group had significantly lower proportion of new episodes of major depression |
| IDD | |||
| Wisner et al., 2001108 | Antidepressant (nortriptyline) | RDC | No significant differences in the rate of recurrence |
| HAM-D | |||
BDI, Beck Depression Inventory; CBT, cognitive behavioral therapy; CES, Current Experience Scale; DMC, Dyadic Mutuality Code; EPDS, Edinburgh Perinatal Depression Scale; GHQ-D, General Health Questionnaire-Depression Subscale; HAM-D,Hamilton Depression Rating Scale; HOME, Home Observation for Measurement of the Environment; IDD, Inventory to Diagnose Depression; ISEL, Interpersonal Support Evaluation List; PSI, Parenting Stress Index; PSE, Present State Examination; PSS, Perceived Stress Scale; RDC, Research Diagnostic Criteria; SCAN, Schedule for Clinical Assessment in Neuropsychiatry; SCID, Structured Clinical Interview for Diagnosis.
The types and frequency of screening measures and the types of interventions applied varied appreciably among the studies we reviewed. Of the 15 studies retained for the full study, 4 examined intervention efforts for which screening had been done in the prenatal period; and 11 studies examined screening and interventions in the postpartum period. The remainder of this section reports on the studies in these two main categories.
Of the four studies examining screening, interventions, and outcomes in the prenatal period,94–97 three were RCTs94, 95, 97 and one was a nonrandomized controlled trial.96 All four studies (published between 1995 and 2001) were set in prenatal clinics. Sample sizes for screening ranged from 37 to 209, for a total population of 473 women. The types of screening instruments used to identify patients with depressive symptoms differed among these studies; similarly, the outcome measures differed, although three studies used the Edinburgh Postnatal Depression Scale (EPDS) as one measure. All four studies implemented some type of psychological intervention, generally characterized as group classes or sessions relating to prenatal preparation, skills, and perinatal support. One study was considered fair; the other three were poor.
Brugha et al. screened 209 women with a modified General Health Questionnaire Depression Score (GHQ-D) to study the effect of six weekly prenatal group therapy classes called “Preparing for Parenthood” compared to routine prenatal care.95 In this study, which we graded as fair, the program aimed to increase social support and problem-solving skills. Outcome measures that assessed maternal mood and depressive symptoms at 3 months postpartum included the GHQ-D (cutoff ≥ 2), the EPDS (cutoff score of > 11), and a Schedule for Clinical Assessment in Neuropsychiatry (SCAN, related to the International Classification of Diseases [ICD], version 10). Assignment to the intervention group did not significantly improve postpartum depression. On the GHQ-D, 26 percent of the intervention group and 22 percent of the control group scored at or above 2, with an adjusted odds ratio of 1.19 (95% confidence interval [CI], 0.59 to 2.37). On the EPDS, 16 percent of the intervention group and 19 percent of the control group scored above 11 ( an adjusted odds ratio of 0.83 (95% CI, 0.39 to 1.79).
In the earliest study we included in this group (1995), Stamp et al. used a study-specific, modified prenatal questionnaire as a screening instrument and assigned 129 patients to either two prenatal group classes plus one postpartum class (at 6 weeks) or routine care.94 Outcome measures were the EPDS at 6 weeks and 6 months postpartum, using a cutoff point of > 12 for major depression and > 9 for major or minor depression. The intervention did not significantly reduce rates of postpartum depression on either measure. For example, at 6 weeks, 13 percent of the intervention group and 17 percent of the control group had EPDS scores greater than 12; at 6 months, the figures were 15 percent and 10 percent, respectively.
Elliott et al. screened 98 women with the Leverton Questionnaire and the depression, anxiety, and somatic subscales of the Crown Crisp Experiential Index.96 The authors studied a preventive group of psychosocial intervention versus routine care; they also looked at differences between first-and second-time mothers. The structured group intervention was conducted once per month for 5 months during the prenatal period (starting at 24 weeks) and for 6 months postpartum. Outcome measures included the EPDS and the Present State Examination (PSE), as well as a self-rating questionnaire, at 3 and 12 months postpartum. For first-time mothers, the median EPDS score was significantly lower in the intervention group (Mann-Whitney one-tailed test, P = 0.005); for second-time mothers, the median EPDS did not differ significantly between the two groups. The PSE served as a formal diagnosis of depression, and the investigators reported no significant differences in diagnosis of major depression. When the authors included cases of borderline depression or “minor depression” in the analysis, first-time mothers in the intervention group were significantly less likely to have a diagnosis of depression than controls (19 percent and 39 percent, respectively, Chi-square = 2.64, one-tailed test, P < 0.05). PSE scores did not differ significantly between groups in second-time mothers.
In the only study in this category done in the United States, Zlotnick et al. used the Beck Depression Inventory (BDI) and a Diagnostic and Statistical Manual (DSM-IV) Structured Clinical Interview for Depression (SCID) as a positive screen among women of low socioeconomic status (SES).97 They excluded patients who met criteria for a current episode of major depression based on the SCID. A total of 37 patients with a positive screen were assigned to either a four-session Interpersonal-Therapy-Oriented Group (given weekly) or to a usual-care group. Outcome measures included the BDI before and after the intervention and the SCID at 3 months postpartum. Women in the intervention group had a significantly greater change in their BDI scores from baseline than did those in the control group (“pre” versus “post” intervention Beck scores were 13.0 and 8.4, respectively, for the treatment group). In contrast, the control group “pre” versus “post” intervention scores were 9.2 and 11.3, respectively, suggesting that they got worse over time. The change between the intervention and the control group was significant (t-test = 3.50; df = 33; P = 0.001). In addition, women in the intervention group had a significantly decreased rate of major depression during the postpartum period as measured by the SCID at 3 months postpartum; no women in the intervention group and 33 percent of women in the usual-care group developed postpartum depression (P > 0.02).
These four small studies of programs for women identified by screening prenatally did not, collectively, produce many positive results from the various psychosocial interventions as compared with usual care. All of these studies scored poor on external validity (0 of 3 points), and two of the four had 0 of 2 points for power. The four studies did, at best, only a fair job of reporting data (from 5 to 7 of 11 points). For bias, the study scores ranged from 1 to 4 of 7 possible points; for confounding, they ranged from 3 to 5 of 6 points. Given the heterogeneity in populations, the screening instruments and cutoff points for defining “at-risk” individuals, the interventions themselves, and the outcome measures used, we cannot draw any overall conclusions about the utility of such programs.
Of the 11 studies examining screening and intervention outcomes only in the postpartum period, 98–108 eight were RCTs published between 1992 and 2003,98, 100, 101, 103–106, 108 and three were controlled trials published between 1992 and 2002.99, 102, 107 Sample sizes ranged from 23 to 1, 215, for a total population of 4,289 women.
As with the prenatal screening studies, the screening instruments used to identify patients with depressive symptoms differed among the postnatal studies, although the EPDS was used in the majority of studies and two studies by the same investigator team used “prior history of postpartum depression.” The treatment interventions also differed considerablyNine of these studies involved various behavioral and psycho-educational programs or other innovative activities (e.g., infant massage or infant sleep interventions); two involved tests of antidepressants. Unlike the prenatal studies, the settings varied from postpartum hospital wards to child-health and immunization clinics. Finally, the outcome measures also varied across these studies, but the EPDS was most commonly used (in seven studies). We graded one study good, seven fair, and three poor.
Behavioral and Psychosocial Interventions. Of the nine studies in this subgroup, one was conducted in the United States; the remainder were in Australia (two studies), Canada (two studies), the United Kingdom (not otherwise specified, two studies), and France and Taiwan (one study each). Using “number randomized or enrolled” as the metric, the sample sizes ranged from 45 to 859. We describe the studies below according to quality grade and sample size.
In a recent study rated good that randomized participants for the intervention (not screening), Dennis et al. screened 501 women recruited from child immunization clinics between 8 and 12 weeks postpartum.101 Inclusion criteria included the mother's being at least 18 years of age, having a singleton birth, and delivering a full-term infant. Women were screened using the EPDS (cutoff score > 9). The 44 women with a positive screen were randomized to a “mother-to-mother” peer support telephone intervention or to routine care. The outcome measures were the EPDS at 4 and 8 weeks after randomization. The women in the intervention group had significantly lower EPDS scores than those in the control group: 15 percent of the intervention group and 52.4 percent of the control group had an EPDS > 12 at 8 weeks (P = 0.02).
In the largest of the seven studies rated fair, Chabrol et al. screened 859 women and identified 258 who were at risk based on an EPDS > 9 on day 2 or 3 postpartum.99 They assigned these 258 women randomly to receive a cognitive behavioral therapy (CBT) (n = 130) intervention or to routine care (n = 128) during the postpartum hospitalization. CBT is a form of psychotherapy that actively examines how cognitions influence emotions or affect and involves active exploration, clarification, and testing of the patient's perceptions and beliefs.109 Outcome measures for the prevention intervention included the EPDS (cutoff ≥ 11) taken at 4 to 6 weeks postpartum.
Women in the CBT group who continued to have positive screens on the EPDS (defined as EPDS score ≥ 11) at 4 to 6 weeks were assessed for major depression in a clinical interview using the Mini-Neuropsychiatric Interview (MINI) and DSM-IV criteria. Those with major depression were offered an at-home CBT program for five to eight additional sessions. These women were then compared with women with probable major depression in the control group at 10 to 12 weeks using the EPDS, Hamilton Depression Rating Scale (HAM-D), and the BDI. Women in the control group received one initial home visit assessment but then received only weekly telephone checks.
The study results demonstrated that women in the prevention intervention group had significant reductions in the frequency of depressive symptoms. At 4 to 6 weeks postpartum, 30.2 percent of those in the CBT group versus 48.2 percent in the control group (P = 0.0067) were still depressed (based on an EPDS score of ≥ 11). Additionally, the intensity of depressive symptoms measured by the mean score on the EPDS was significantly lower in the prevention group than in the control group: mean EPDS scores, respectively, of 8.5 (standard deviation [SD] 4) and 10.3 (SD 4.4) (t-test = 3.06, df = 209, P = 0.0024); the analyses indicate a medium effect size (ES, 0.42). At 10 to 12 weeks postpartum after completion of the home-based CBT intervention, women in the intervention group had significantly lower scores on all measures of depressive symptoms (HAM-D, BDI, EPDS) than did those in the control group. Specifically, the intervention and the control group mean scores were as follows: HAM-D, 5.7 versus 16.2 (t-test = 8.4, P < 0.0001); BDI, 4.7 versus 15.7 (t-test = 9, P < 0.0001); and EPDS, 5.9 versus 13.7 (t-test = 7.7; P < 0.0001).
Armstrong et al. screened 181 women with good literacy skills in the immediate postpartum period by asking about a history of trauma or abuse or a positive screen for adverse family characteristics on the Brisbane Evaluation of Needs Questionnaire.98 Women were randomized to receive 6 months of home visits by a child-health nurse or routine primary care. Primary outcome measures involved measures of child health;, parental and family functioning (measured by the EPDS [cutoff > 12] and the Parenting Stress Index [PSI]), quality of the home environment (HOME assessment), and satisfaction with community services. All assessments were administered immediately postpartum.
If we focus primarily on scores of maternal depression and functioning at 6 weeks postpartum, women in the intervention group had significantly lower depression scores than the control group: 5.8 percent in the intervention group and 20.7 percent in the control group (P = 0.003) with EPDS > 12. Additionally, women in the intervention group had significantly lower (better) PSI scores at 6 weeks than controls (15.3 versus 38.4, P < 0.001). The investigators also reported that the total HOME score differed significantly between groups: 28.34 for the intervention group versus 25.51 for the control group (P < 0.001), providing evidence for the positive effect the intervention had on influencing parent-infant interaction and the home environment for the child.
Hiscock and Wake recruited 155 women from a child-health center at 7 to 8 months postpartum and screened with the EPDS (cutoff > 12).103 Other inclusion criteria included reported child sleep problems. Of these women, 99 were considered depressed (baseline EPDS ≥ 10) and randomly assigned to either an infant sleep intervention group or a usual-care group. The infant sleep intervention comprised three private sessions (one session every 2 weeks) held at the local child-health center where sleep management plans were discussed, including an emphasis on controlled crying (where parents responded to their infants' crying at increasing time intervals, allowing the infant to fall asleep unaided). At the 10- to 12-month follow-up assessment, outcome measures included the EPDS, measures of sleep quality, and measures of maternal stress and coping.
The results of this study were mixed. Women who began with higher (worse) scores of depression at baseline had a significantly greater improvement in their EPDS scores than did those in the control group. At the 10-month follow-up, women in the intervention group had a 6.0 point decrease (95% CI, 7.5 to 4.0) in EPDS score compared to a 3.7 point decrease (95% CI, 4.9 to 2.6) in the control group (P = 0.01). At the 12-month follow-up visit, the intervention group had a 6.5 point decrease (95% CI, 7.9 to 5.1) in EPDS score compared to a 4.2 point decrease (95% CI, 5.9 to 2.5) in the control group (P = 0.04). Also, at the 10-month follow-up, women in the intervention group reported improvements in their own sleep quality, including being more likely than control mothers to rate their own sleep quality as “very good” and less likely to rate it as “very bad” (Chi square = 9.93; P = 0.02). They also reported having “enough sleep” and were less likely to have “not enough” sleep (Chi square = 8.11, P = 0.04).
Horowitz et al. screened 1,215 women at 2 to 4 weeks postpartum with the EPDS (cutoff > 10). Women with positive screens (n = 122) were randomly assigned to either an interactive coaching intervention or a control group. The coached behavioral intervention was designed to promote maternal-infant responsiveness. All women in the study (both intervention and control groups) received three home visits when their infants were 4 to 8 weeks, 10 to 14 weeks, and 14 to 18 weeks of age; the women in the intervention group practiced the coaching intervention during these visits. Outcome measures included the BDI-II for maternal depression and, secondarily, the Dyadic Mutuality Code (DMC), a measure of the level of responsiveness in the maternal-infant relationship. Responsiveness was defined as “the mother's ability to accommodate to her infant's behavior and to give it meaning through regulation of her own behavioral responses” (p. 326). The intervention and control groups did not differ significantly in terms of maternal depression scores (BDI-II) at any time period. The DMC showed a significantly better outcome for mother-infant responsiveness for the treatment group (P = 0.06).
Onozawa et al. screened 581 primiparous women with the EPDS (cutoff ≥ 13) at 4 weeks postpartum.106 Of 91 women who had a positive screen, 59 agreed to participate in the study. Participants were randomized to either a 5-week infant massage class with a support group or the support group only. The 1-hour infant massage class (approved by the International Association of Infant Massage) taught parents the techniques of infant massage by encouraging parents to observe and respond to their infants' body language and cues and to adjust their touch accordingly. Outcome measures included maternal depression on the EPDS (cutoff ≥ 13) at 4 weeks and 2 months postpartum and a videotaped mother-infant interaction that assessed the mother's attitude toward the infant, the infant's response to the mother, and the overall quality of the interaction. At 14 weeks postpartum, EPDS scores had fallen for both groups (reported as a change in median EPDS score from baseline to final visit), but the intervention group demonstrated a significantly greater change in scores than did the control group (intervention group baseline of 15.0 and final visit score of 5.0, versus the control group score of 16.0 at baseline to 10.0 at the final visit; P = 0.03). Additionally, significant improvements in all aspects of mother-infant interaction as measured by the videotape were seen only in the massage group (P = 0.0004).
Honey et al. used the EPDS (cutoff > 12) to screen postpartum women recruited through mother-baby clinics but assessed at home.104 The 45 women with a positive screen on the EPDS were randomly assigned to either an 8-week psycho-educational group (PEG) or to a routine care group. Outcome measures included the EPDS (cutoff > 12) after completion of the PEG and at a 6-month follow-up. At the end of the 8-week assessment interval, the women in the PEG did not differ significantly from those in the routine-care group. By contrast, at the 6-month follow-up assessment, the percentage of women scoring below the EPDS cutoff for a probable major depressive episode was significantly higher in the PEG group than in the routine-care group (65 percent versus 36 percent, Chi square = 3.75; P ≤ 0.05). An additional analysis demonstrated that the use of antidepressant medication during the study had no impact on the improvement in mood observed at the 6-month follow-up assessment.
Fleming et al. screened 781 primiparous women with full-term deliveries and no psychiatric history during their first 2 weeks postpartum using the EPDS (cutoff ≥ 13), the Current Experience Scale (CES, cutoff ≥ 35), and the Multiple Affect Adjective Checklist (MAACL, cutoff ≥ 21).102 Women with a positive screen (n = 142) were assigned (not randomly) to either a postpartum social support group that included both depressed and nondepressed women or a usual-care group.
Outcome measures included the EPDS and the CES at the same cutoff scores used for screening. At the 6-week and 5-month follow-up assessments, the groups did not differ significantly with respect to rates of maternal depression, and the support groups had no apparent effect on the mothers' general affective mood. However, women in the social support group had a statistically significant increase in the number of maternal-infant interactions and noted decreased infant crying compared to women in the routine-care group.
Chen et al. screened 414 women at 3 weeks postpartum using the Taiwanese BDI (cutoff ≥ 10).100 Of these, 115 women with positive screens were randomized to weekly support groups or to a routine-care group; 60 patients were available for analysis. Outcome measures included the BDI (cutoff ≥ 10), Perceived Stress Scale (PSS), and measures of interpersonal support. At the 15-week follow-up assessment, women in the intervention group had significantly lower rates of depression: 33.3 percent of the intervention group and 60.0 percent of the control group had BDI values equal to or greater than 10 (P < 0.05). The rate of perceived stress was also significantly lower in the intervention group than the control group (t-test = 3.75, P < 0.01). Finally, women receiving the intervention reported significantly more interpersonal support as measured by the Interpersonal Support Evaluation List than those in the control group (t-test = 2.81, P < 0.01).
Pharmacologic Studies. Two of the studies were psychotropic medication trials to prevent the occurrence of postpartum depression. The women were not directly screened with any instrument, but rather were included if they had a previous history of postpartum depression. The same research team conducted both of these pharmacologic trials. In the first trial, Wisner and Wheeler studied the efficacy of antidepressant treatment in women with a previous history of postpartum depression (i.e., at high risk of maternal depression but no history of psychosis or bipolar disorder).107 At-risk postpartum women (n = 23), who had had at least one episode of postpartum depression were treated in an open clinical trial with the tricyclic antidepressant nortriptyline and postpartum monitoring or with postpartum monitoring only. Outcome measures included a clinical assessment of major depression and the Inventory to Diagnose Depression scale. After 3 months, study results demonstrated a significantly greater proportion of new-episode major depression in those patients who received monitoring alone than in those in the medication group (62.5 percent of those in the monitoring group; 6.7 percent in the medication group; P = 0.0086).
In a later Wisner et al. RCT, 56 women with a prior history of postpartum depression within the past 5 years but no depressive episode upon enrollment, as diagnosed by standardized research diagnostic criteria, were randomized to either a nortriptyline group or a placebo group immediately postpartum.108 Outcome measures of recurrence of perinatal depression included the HAM-D and Research Diagnostic Criteria (RDC). In contrast to the earlier open-label trial, the investigators reported no difference in the rate of recurrence of depression (one-fourth) between women treated with nortriptyline and those receiving placebo (23 percent versus 24 percent, respectively).
None of the studies used treatment interventions that are recognized as the gold standard treatment for major depressive illness according to current American Psychiatric Association guidelines. These guidelines specify that the gold standard include antidepressant medication plus psychotherapy.12.
Only three studies had quality scores of 18 or higher: Wisner et al.,108 Armstrong et al.,98 and Dennis et al.101 All three enrolled women in the postpartum period and had a fairly intensive treatment approach consisting of weekly interventions (Wisner et al. for 20 weeks, Dennis et al. for 8 weeks, and Armstrong et al. for 6 weeks). The Wisner et al. study had a pharmacologic intervention with weekly assessments of efficacy but no psychotherapeutic intervention; by contrast, the Dennis et al. and Armstrong et al. studies had weekly psychotherapeutic interventions but no pharmacologic intervention. Interestingly, although Wisner et al. treated patients for 20 weeks postpartum with antidepressant medication, their study did not have a significant result. This finding may suggest that psychosocial support and psychotherapeutic intervention are both critical as part of a treatment plan for women with postpartum depression.
The 15 studies examined a variety of screening and treatment interventions for women identified as being at risk (sometimes at high risk) for postpartum depression. The majority of these studies focused on intervention strategies in the postpartum period; all but two dealt with a wide array of psychosocial, education, skill-building, and other mother-child behavioral activities. Generally, the more successful efforts occurred in the studies in which screening and interventions were carried out in the postpartum, not the prenatal, period. Once again, none of the studies had a treatment intervention with both psychotherapeutic and pharmacologic components that would be considered “gold standard therapy” for the treatment of major depression.
Overall, many of the studies suggest that providing the mother with some form of psychosocial program to increase maternal support or improve maternal-child interaction may decrease the rate of postpartum depression. Across the nine nonpharmacologic studies, about 20 outcomes were assessed; of these, 12 showed significant effects for the intervention group. Taking only the outcomes dealing specifically with depression, nine significant effects were reported. The two small pharmacologic trials from the same group yielded conflicting results about the impact of nortriptyline in reducing recurrence of maternal depression.
Only one study97 specifically studied low SES women—a matter of some interest to the Safe Motherhood Group. Low SES women with at least one risk factor for postpartum depression who participated in weekly prenatal survival skills classes were less likely to develop postpartum depression compared to controls. This small study suggested that increasing support and parenting skills may help to decrease postpartum depression in this particular population.
This set of studies, however, has several limitations, and it can be regarded as offering, at best, only fair evidence about the utility of screening plus prevention or treatment programs or even interventions alone. Although a variety of interventions may be helpful in treating women with or at risk of perinatal depression, the available evidence does not directly address whether screening with subsequent intervention improves outcomes. Screening, in the classic sense, implies “examination of a group of usually asymptomatic individuals to detect those with a high probability of having a given disease” (italics added; http://dictionary.reference.com/search?q=screening); this meaning can be extended to using appropriate screening or diagnostic tools within populations with known risk factors. These studies provide little guidance in answering the practical question of whether clinicians should screen all women in the perinatal period (i.e., essentially an asymptomatic population with respect to depression) for risk factors or latent depression, or whether they should screen only women who have known prior histories or risk factors for depression.
The studies are generally small, with poor generalizability (especially to the heterogeneous childbearing population of the United States). We contemplated and rejected the idea of any quantitative analyses: populations, settings, and screening and outcome measures—let alone interventions—were simply too disparate for anything but qualitative synthesis.
To overcome some of these problems in understanding the impact of programs designed to prevent the problems of perinatal depression or to mitigate the considerable deleterious effects of this disorder on mothers, infants, and families, considerably more and better research needs to be conducted. Possibly the most important issue is for future studies to enroll adequate samples of women and, if screening is the question, to screen quite large numbers of women to produce sample sizes with adequate power to detect relevant differences between treatment and control groups in later phases of these studies. Virtually all studies appeared to be underpowered to start with, and some lost participants along the way. This deficiency hampers investigators and policymakers in making sense of, or decisions based on, much of this work.
Moreover, a greater effort should be made, at least in US-based studies, to focus on ethnic and disadvantaged populations, such as low-income women. Even if the incidence and prevalence of perinatal depression were “evenly” spread over population groups in this country, the underfunding of health care for many (e.g., lack of insurance, poor coverage of mental health benefits in insurance plans, unavailability of publicly funded services) and the more precarious economic resources and family support for some populations means that additional attention needs to be paid to them. For example, programs may need to be designed to take lack of transportation, child care, or telephone access into account.
In addition, researchers might direct attention to several other variables that appear to be important. They include first-time versus second-time mothers, maternal comorbidities and lifestyle behaviors, family structure (make-up) and available support, and status of infant at birth (e.g., full term or not, healthy or not). Another gap may be programs intended to assist the mother-father dyad or, indeed, to assist fathers in providing the emotional or physical support needed to forestall depression in new mothers.
Ideally, researchers would employ similar screening measures with similar cutoff points so that some elements of separate studies could be compared more readily. Not all of the screening instruments used appear to be sufficiently well-targeted to perinatal depression (i.e., even if they are reliable, their validity for this purpose may be called into question). Moreover, some instruments may be relatively infeasible for use in certain populations (e.g., immigrants) or in cases in which patient self-report is important and literacy may be low. For these situations, some work to calibrate well-known instruments that have been specifically designed for this disorder and that have acceptable test properties against each other might be useful. Calibrating less well-known or well-proven instruments against some agreed-upon reference (“gold”) standard instrument in this area might also be valuable. Testing these in different settings, trying to use shorter instruments, attempting to take literacy levels into account, and in other ways improving the screening armamentarium are also important steps. In that way, investigators and clinicians can have a better selection of proven screening tools for future research or clinical practice applications.
Another element warranting more clarity is the purpose of the screening-cum-intervention effort. All appear to relate to populations of women at risk of perinatal depression (particularly postpartum depression that goes beyond “maternal blues”), but the severity level of being at risk differed in these studies. Moreover, women could have had no prior history of depression (or perinatal depression) and be at risk; alternatively, they could have had some history, especially of postpartum depression, and be, essentially, at “high” risk. These distinctions did not seem to be well or consistently described across these studies. They also have implications for the goals of the interventions themselves: for example, preventing any “first episode” of depression, mitigating the effects of a first episode that is not wholly prevented, or preventing a recurrence.
Interventions tested in the future would, ideally, be those shown to have some promise so far (e.g., as reflected in some of the studies reported here). The components of the programs should be of appropriate length and intensity, and published articles should describe them thoroughly. In addition, interventions should be consistent with current evidence-based practice standards for the treatment of major depression. Multiple studies of the same interventions, perhaps at different time periods or different settings and populations, might be helpful in completing the picture of the impact of screening and interventions on occurrence or reoccurrence of perinatal depression. Finally, outcome measures should be appropriate to the research questions and preferably selected from among the more reliable, valid, and widely used instruments. These steps might help fill the gaps in this knowledge base and permit those performing systematic reviews to compare and synthesize studies more readily.
In an effort to identify the evidence base addressing important questions on the epidemiology, screening and diagnosis, and management of perinatal depression, the Safe Motherhood Group (SMG) and the Agency for Healthcare Research and Quality (AHRQ) initially requested a feasibility study to determine whether enough high-quality evidence existed on six separate issues to support a full evidence report. After reviewing our feasibility study,24 SMG and AHRQ requested an evidence report focusing on the three key questions (KQs) covered in this review.
We applied rigorous selection criteria and assessed the quality of each study, bringing a public health perspective to an area of research that traditionally has not had this focus. Our report was limited to depressive illness without psychotic symptoms, the latter complication being much less common and much more challenging to identify and manage. We made a distinction between results involving major depression alone, a discrete clinical syndrome for which treatment is clearly indicated, and results referring to patients with either major or minor depression, for which management is less clear.
This evidence report comprises a comprehensive review of all the available research. In this final chapter, we first review the major findings pertaining to each question and the strength of overall evidence about these issues; we then present some observations and recommendations about future research.
For KQ 1, we identified 30 studies of generally moderate size that provide estimates of the prevalence of perinatal depression; 13 of these inquiries provide estimates of incidence. Studies were generally of good quality for reporting completeness and internal validity for bias; by contrast, they were of fair quality for precision and only poor quality for external validity and internal validity for confounding. In particular, the study populations were not representative of the racial and ethnic mix of the countries in which the studies were performed and especially not of the United States.
Our final best estimates of prevalence and incidence were somewhat lower than those reported in prior systematic reviews because we excluded studies that assessed depression based on self-report screens alone, which have been found to overestimate prevalence. Also, we separated out estimates of major and minor depression from estimates of major depression alone. Finally, we included more recent studies that use more precise criteria to identify major depression.
For major depression alone, our final combined point prevalence estimates ranged from 3.1 percent to 4.9 percent at different times during pregnancy and from 1.0 percent to 5.9 percent at different times during the first postpartum year. For major and minor depression, our final combined estimates of point prevalence ranged from 8.5 percent to 11.0 percent at different times during pregnancy and between 6.5 percent and 12.9 percent at different times during the first year postpartum. This nearly 2-fold higher rate suggests that approximately half of the women experience a major depressive episode and half a minor depressive episode at any given time. Confidence intervals surrounding all these estimates remain wide, suggesting that a fair amount of uncertainty remains in the combined estimates.
Fewer estimates were available for the incidence of depression. These limited data suggest that as many as 14.5 percent of pregnant women have a new episode of either major or minor depression during pregnancy, and 14.5 percent have a new episode during the first 3 months postpartum. Considering only major depression, 7.5 percent may have a new episode during pregnancy, with 6.5 percent having a new episode in the first 3 months postpartum.
Are the prevalence and incidence of depression during the perinatal period higher than the rates during nonchildbearing periods? We found three studies that measured the prevalence of major or minor depression and major depression alone for women at different times during these two periods . None of these estimates shows a statistically significant difference. Only one study20 directly compared the incidence (new onset) of perinatal depression to that of nonchildbearing women of similar age; women at 5 weeks postpartum were more than three times as likely as the comparison group to have a new episode of major or minor depression. By 6 months postpartum, this difference had disappeared. An incidence for major depression alone was not reported.
That these estimates did not appear significantly different from those of nonchildbearing women of the same age does not reduce the dramatic burden experienced by women postpartum. Indeed, these estimates, based on the best available evidence, suggest that perinatal depression, whether major or minor depression, is a very common complication of pregnancy. Furthermore, and arguably more important, after labor and delivery this dramatically common complication, rather than primarily affecting one individual, now directly affects two: mother and child.
For our analysis of the accuracy of screening tools (KQ 2), we identified 10 studies reporting test characteristics for English-language screeners. In general, studies were of fair to good quality, although external validity was only poor to fair. In particular, the study populations were nearly entirely white, so the accuracy of these screeners in nonwhite perinatal populations is not clear. A major limitation in the available evidence is the very small number of depressed patients involved, a fact that results in substantial imprecision in the point estimate of sensitivity and prevents one from reasonably determining an ideal cutoff point.
For depression during pregnancy, we found only one study reporting on screening accuracy in a population with 6 patients with major depression and 14 patients with either major or minor depression. For major depression, sensitivities for the Edinburgh Postnatal Depression Scale (EPDS) at all evaluated thresholds (12, 13, 14, 15) were 1.0, underscoring the markedly small number of depressed patients involved; specificities ranged from 0.79 (at EPDS ≥ 12) to 0.96 (at EPDS ≥ 15). For major or minor depression, sensitivity was much poorer (0.57 to 0.71); specificity remained fairly high (0.72 to 0.95).
For postpartum depression also, the small number of depressed patients involved in the studies precluded identifying an optimal screener or an optimal threshold for screening. Our ability to conduct a meta-analysis of the results of different studies was limited by the use of multiple cutoffs and other differences across studies that precluded a meaningful interpretation of the results. Where we were able to combine the results , the pooled estimates did not add to what one could conclude from individual studies.
For women with major depression alone, specificity for all screeners (the Beck Depression Inventory [BDI], the Postpartum Depression Screening Scale [PDSS], and the EPDS) was relatively high. This finding suggests that a positive screen was accurate in ruling major depression in; that is, the risk that a screen with one of these instruments would be falsely positive was low. By contrast, sensitivities varied much moreThe EPDS and the PDSS appeared to be more sensitive (with estimates ranging from 0.75 to 1.0 at different thresholds) than the BDI instruments (with estimates from 0.32 to 0.68), but the wide confidence intervals (CIs) overlapped nearly completely. This means that we could not say with confidence that the specificity estimates using the different tools were different.
The point estimates are consistent with what is reported for depression screeners in primary care settings.90 Still, the imprecision is important to clarify. If falsely missing depression (a false negative) is worse than falsely identifying it, as may be the case with this disorder, clinicians must be able to feel confident that the screen is usually positive if the disease is there and that a negative result can help rule out the illness.
For patients with major or minor depression, results were reported for EPDS, BDI, PDSS, and the Center for Epidemiologic Studies-Depression (CES-D). Specificity estimates remained relatively high, but sensitivity results were much lower (ranging from 0.43 to 0.71) than for major depression alone. This means that the ability of the screening instrument to score women as positive for this condition when the disease is present was poorer than for major depression alone. Again, neither any particular cutoff nor any particular screening instrument performed differently from the others. No available comparators were found for primary care populations.
Our results suggest that various screening instruments can identify perinatal depression, most accurately major depression, but clinicians need to know more about the precision of individual instruments. If one assumes that the risk of a false-negative depression screen is worse than the risk of a false-positive screen, perinatal depression is a condition in which sensitivity is likely to be more important than specificity. Whether as a screen for major depression alone or for major or minor depression, specificities appear high and relatively precise. By contrast, sensitivity for identifying either category is imprecise and differs by diagnostic category. For major depression alone, point estimates are equivalent to those in primary care medical settings. For major or minor depression, however, sensitivity is quite low. At this time, these screens do not appear to be useful for identifying patients in this latter category of illness.
KQ 3 concerned issues of whether screening ultimately leads to improved patient outcomes. Although it is the most vital question from the public health perspective, it is the one with the most limited evidence. Indeed, the studies that we identified were not designed to test whether screening for depression (versus not screening) improved patient outcomes. Such a design would randomize patients to be screened or not to be screened and then compare subsequent outcomes. We found no studies designed in this way.
Instead, we made use of studies in which women were screened by formal depression screening or the presence of a risk factor associated with perinatal depression to identify those at risk of having a depressive illness; then, for those screening positive, the investigators compared the outcomes of women receiving a treatment intervention to those in a control group. This design tests whether, among women identified as at risk of depression by a screen, an intervention improves outcomes compared to the outcomes in a control group. This is an important intermediary step, but it does not directly test whether screening itself improves outcome compared to not screening. All the trials included are treatment studies that had a screening component (either a formal depression screening instrument, or other type of screen that identified women at risk of a depressive illness) but did not have diagnostic confirmation of depression.
We attempted to synthesize the results of the included studies quantitatively, but the study methods (screening instruments, type of intervention, intensity of intervention, outcomes measured) were so heterogeneous that a meta-analytic synthesis would not be meaningful. We also attempted to compare effect sizes to attempt an exploratory analysis of the various studies, but the data necessary to compute these were not available.
For patients whose screening results identified them as at risk of perinatal depression and for whom a subsequent intervention was provided, we identified 15 studies. Four small prenatal studies involved various psychosocial interventions. Quality was poor for three of these studies and fair for one. Overall, the effects of the interventions in these studies were not consistently superior to those in the control groups.
The 11 postpartum studies were of overall fair quality and had larger sample sizes than the prenatal trials. Study populations reflected only a limited racial and ethnic mix, and both external validity and the power to demonstrate statistically significant differences were generally poor. Again, screening tools and interventions varied considerably; the latter involved both psychosocial and pharmaceutical interventions.
Results were mixed. Of the nine trials that employed a psychosocial intervention, six studies98–101, 103, 104 reported significant benefit for depression outcomes in the experimental group compared to those in the control group. The one RCT involving pharmacologic intervention did not show benefit relative to the control group.108 Overall, the evidence available is not sufficient to draw conclusions about this key question. These results, although limited, do suggest that providing some form of psychosocial support to pregnant women at risk of having a depressive illness may decrease depressive symptoms.
The available research suggests that depression is one of the more common complications of the prenatal and postpartum periods and that fairly accurate and feasible screening measures are available. The prenatal or postpartum periods are clearly not times for nonpsychiatric clinicians to ignore depression screening, which is routinely recommended for patients seen in primary care settings.110, 111
Specifics of the course of a depressive illness with onset during the perinatal period, including the severe physiologic and psychological challenges unique to this period that complicate the identification and management of perinatal depression, seem to suggest that this topic would have a substantial degree of high-quality research. We were surprised by the paucity of such evidence in this area. If one assumes that perinatal depression is a significant mental health and public health problem, then larger scale studies are needed involving each of these domains. The small number and small size of relevant studies are not adequate to guide national policy.
Reflecting on the three key questions addressed in this report, we have concluded generally that the level of research warrants both improvement and expansion. The three results chapters discuss the limitations and gaps in these areas in more detail. We summarize here our suggestions for additional research efforts for the future.
For KQ 1, prevalence studies need to account better for the racial and ethnic mix of perinatal depression in the US population. We do not have good evidence about whether and, if so, how perinatal depression rates differ among various ethnic groups. The absence of information on nonwhite populations was dramatic. Better understanding any racial and ethnic variations could help clinicians know where to target screening programs and researchers know where to target studies on screening tools, and it could help researchers clarify the need for more nationally representative perinatal depression samples. Furthermore, researchers need to clarify whether the incidence of perinatal depression is greater than the incidence of depression in nonchildbearing women of similar ages.
For KQ 2, the quality grades point to several areas in which improvements in study design and conduct are needed. In particular, future studies on the test characteristics of screeners must be designed with sample size estimates that take prevalence into account and that project a reasonably precise estimate of sensitivity for the particular illness. Moreover, samples should more closely mirror the target population; specifically, subsequent studies need to provide a more representative racial and ethnic mix. In addition, studies should incorporate a range of other demographic variables that could influence screening performance, such as socioeconomic status measures, and assess the screening tools in these subpopulations.
Furthermore, as Beck and Gable did,71 future research should continue to assess and directly compare multiple screening instruments. This design would provide a head-to-head comparison to allow an evaluation of which screening instrument is more accurate in the setting in which the investigations are carried out. Moreover, studies evaluating the cost-effectiveness of screening, specifically assessing the relative costs of false-negative and false-positive designation, the degree of provider burden, and patient acceptability, are needed to provide insights on how to consider target sensitivity and specificity when attempting to maximize cost-effectiveness.
Diagnosis is another area of concern. Subsequent studies should carefully consider whether to target major depression alone, for which beneficial treatments clearly exist, or the combined category that includes minor depression, a heterogeneous group for which treatment benefit is unclear. Given that the results suggest that available screening tools identify major depression alone more accurately, and noting that the general benefit of interventions is more apparent for major depression alone, we believe that an evidence-based public health perspective recommends targeting major depression alone.
Timing is another factor of future studies deserving more thought. The issue here involves both the need for more epidemiology to confirm prevalence rates at different times as well as the need to confirm what time point(s) would identify the greatest number of depressed women. The bulk of the few screening studies we identified had been conducted in the first 3 months postpartum. Our best estimates of prevalence suggest that depression may remain high for several more months.
More studies are needed to better delineate periods of peak prevalence and incidence, to include not just 3 months but also 6 weeks, 6 months, and 12 months, and subsequent screening studies need to consider testing properties of screening at these later time periods. The very small number of adequate studies currently available hampers plans for screening and intervention programs because the best time for screening, and hence the best clinic location, is not clear. If peak prevalence and incidence occur within the first 6 weeks, the obstetrics clinic is a prime place to target resources for such a program. If, however, peaks occur after this time, most postpartum women will have completed follow-up care with an obstetrician, so programs in an obstetrics clinic may be less helpful. In this case, programs targeting new mothers in family medicine, internal medicine, or pediatric clinics might be more effective.
For KQ 3, several similar or related issues emerged as well. First, studies addressing the relationship between screening and outcome need to recruit and retain sample sizes that are large enough to yield adequate power to detect relevant differences. Second, screening and outcome studies must include populations with a racial and ethnic mix that is more representative of the US populations than the work we have seen to date. Third, interventions involved should be more consistent with what we know to be evidence-based treatments for depression,12 i.e., antidepressant medications112 and/or psychotherapies such as cognitive behavioral therapy113 or interpersonal psychotherapy.114
Type of screening measures used henceforth is another major issue. Of the three KQ 3 studies rated as good, 98, 101, 108 only Dennis and colleagues used a depression screener (EPDS).101 Researchers should consider developing and using standardized screening measures, and similar cutoff points, so that some elements of separate studies could be compared more readily. Screening tools with the best supporting evidence would seem to be the best candidates. While the evidence base remains quite limited and any conclusions preliminary, at this time those instruments would appear to be either the EPDS or the PDSS. For major depression alone, an EPDS cutoff of ≥ 13 or a PDSS cutoff of ≥ 81 are reasonably supported by the evidence. For major or minor depression, we found the results too inconclusive to make even a preliminary recommendation.
Finally, studies should be designed to address whether the screening process itself leads to better access to proven treatment and improved outcome relative to usual care. We support additional research on interventions per se, but we conclude that important questions remain about the impact of the screening element. Reviewing studies that used screening as a means of identifying women potentially at high risk and enrolling them in interventional studies is not a sufficient approach to answering issues about the effectiveness of screening.
Bipolar disorder - a type of mood disorder characterized by both (1) one or more major depressive episodes and (2) either one or more manic or mixed episodes (Bipolar 1) or hypomanic episodes (Bipolar II). The disorder may or may not be accompanied by psychotic symptoms. In community samples, the prevalence of bipolar disorder (approximately 1 percent) is lower than the prevalence of major depressive disorder (at least 6 percent). Given that management of bipolar disorder is notably different from that of major depressive disorder, making such a diagnostic distinction is critical.
External validity - the extent to which a study's conclusions can be applied to populations and settings outside those of the study itself.
Incidence - the percentage of the population with an illness episode that begins within a given period of time (e.g., during pregnancy or within the first 3 months following delivery).
Internal validity - the extent to which a study is appropriately designed and conducted to measure what it is intended to measure.
Major depressive disorder - a type of mood disorder characterized by one or more major depressive episodes. The Diagnostic and Statistical Manual, version III (DSM-III) defines a major depressive episode as a period of at least 2 weeks during which an individual experiences daily disturbance in mood (intense feelings of sadness or loss of interest in activities that are usually pleasurable) and at least four of eight symptoms: (1) too much or too little sleep, (2) appetite or weight disturbance, (3) psychomotor agitation or retardation, (4) loss of energy, (5) feelings of worthlessness or excessive guilt, (6) problems with concentration or indecisiveness, (7) loss of interest in sex, and (8) recurrent suicidal thoughts or attempts. DSM-IV changed these criteria to the following: (1) symptoms must be present most of the day and nearly every day during the episode, (2) clinically significant distress or impairment in functioning must be present, (3) the syndrome must not be the result of the direct physiologic effects of a substance or a general medical condition, (4) major depressive disorder is still diagnosed after an acute grief reaction if the syndrome lasts for more than 2 months.
Major depressive disorder is not diagnosed if the syndrome is attributable to an acute grief reaction or a nonaffective psychotic condition such as schizophrenia. In addition, major depressive disorder is not diagnosed if there is a history of a manic, hypomanic, or mixed episode.
Maternity blues - a subthreshold cluster of depressive symptoms commonly described in up to 50 percent of postpartum women. This transient condition does not require an intervention.
Meta-analysis - a quantitative approach for systematically combining evidence from multiple previous research studies on a particular parameter or association to arrive at a conclusion about the body of research on that parameter or association.
Meta-regression - a statistical analysis of the association between one or more study characteristics and the observed magnitude of effect.
Minor depressive disorder (also known as minor depression) - a subthreshold diagnosis with a variety of definitions, but in general seen as one or more episodes of depression lasting 2 weeks or more but with fewer symptoms than required for a diagnosis of major depressive disorder.
Period prevalence - the percentage of the population with depression over a period of time (e.g., during pregnancy or from delivery to the end of the first 3 months postpartum).
Perinatal depression - a condition encompassing major and minor depressive episodes that occur during pregnancy (prenatal) or within the first 12 months following delivery (postpartum).
Point prevalence - the percentage of the population with a condition at a given point in time (e.g., at 24 weeks gestation or 9 weeks postpartum).
Postpartum - for the purposes of this review, the period from parturition to 12 months after delivery.
Postpartum depression - according to DSM-IV, a specific type of major depressive disorder with onset of a major depressive episode within 4 weeks postpartum.
Postpartum psychosis - also known as puerperal psychosis, this condition is a severe and rare postpartum disorder, affecting 1 to 2 per 1,000 births. Women with postpartum psychosis present with new onset of delusions or prominent hallucinations. More than half of these episodes meet the criteria for major depressive disorder, and many women ultimately prove to have bipolar illness. Management of postpartum psychosis substantially differs from the much more common presentation of major depressive disorder with postpartum onset.
Power (statistical power) - the probability of detecting as “statistically significant” a postulated level of effect.
Precision - a measure of how close an estimator is expected to be to the true value of a parameter. Precision is related to the standard error of the estimator; less precision is reflected by a larger standard error.
Prenatal - the period of pregnancy from conception to parturition.
Puerperium - the 6-week period following delivery.
Reference standard (also known as gold standard) - the diagnostic assessment against which the screening test is compared to gauge the accuracy of the screening test. The reference standard determines the actual presence of disease. For psychiatric illness, the reference standard is often a clinical assessment by a mental health professional or a structured or semi-structured diagnostic interview.
Screen (also screening) - the use of a measure or test, often a formal instrument or tool, to classify an individual with respect to her likelihood of having a particular disorder. A screen itself does not diagnose the illness—those screening positive require subsequent diagnostic confirmation to confirm the presence of the disease.
Sensitivity - the ability of a test to identify correctly those who have a condition, computed as the percentage of true positive values correctly predicted by the test. A sensitive test identifies few false-negative cases.
Specificity - the ability of a test to identify correctly those who do not have a condition, computed as the percentage of true negative values correctly predicted by the test. A specific test identifies few false-positive cases.
Database: MEDLINE <1966 to March Week 3 2004>
Search Strategy:
1 exp Puerperal Disorders/ (16527)
2 exp Depression/ (32747)
3 exp Depressive Disorder/ (42005)
4 2 or 3 (73267)
5 1 and 4 (1452)
6 exp Depression, Postpartum/ or perinatal depression.mp. (753)
7 5 or 6 (1467)
13 limit 7 to (human and english language) (1299)
CINAHL used these terms as well.
PsycINFO has “Depression, Postpartum” as a Major Descriptor that yields 379.
Sociofile indexes 105 records to “Postpartum Depression”.
For Key Question 1, the following terms were used:
20 exp Natural History/ (8432)
21 8 and 20 (0)
When “Natural History” yielded no results, the following terms were used:
22 exp Cohort Studies/ (466831)
23 8 and 22 (112)
24 exp Longitudinal Studies/ (438062)
25 8 and 24 (101)
26 23 or 25 (134)
CINAHL (using similar terms) = 35
PsycINFO (natural history, cohort, longitudinal) = 65
Sociofile (natural history, cohort, longitudinal) = 20
Total from all databases for Key Question 1 = 254
After duplicates, book chapters, foreign language articles and dissertations were removed, the total unduplicated count for KQ1 = 210.
For Key Question 2, Incidence, the following terms were used:
MEDLINE
19 exp INCIDENCE/ (76679)
20 8 and 19 (31)
CINAHL (Incidence) = 7
PsycINFO (Incidence) = 23
Sociofile (Incidence) = 1
Total file = 62, minus duplications, dissertations, etc = 46
For Key Question 3, Risk, the following terms were used:
16 exp Risk Factors/ (221767)
17 8 and 16 (153)
CINAHL (Risk Factors) = 32
PsycINFO (risk) = 59
Sociofile (risk) = 11
Total from all databases for Key Question 3 = 255
After duplicates, book chapters, foreign language articles and dissertations were removed, the total unduplicated count for KQ3 = 204.
For Key Question 4, Therapies, the following terms were used:
MEDLINE
12 treatment.mp. or exp Therapeutics/ (2537613)
14 8 and 12 (513)
CINAHL (Treatment) = 90
PsycINFO (Treatment) = 91
Sociofile (Treatment) = 5
Total file = 699, minus duplications, dissertations, etc = 485
For Key Questions 5 and 6, Screening Accuracy and Screening Barriers, searches focused on “screening” and will give the total pool to investigators for finer sorting between questions.
MEDLINE
9 exp mass screening/ (62902)
10 8 and 9 (67)
CINAHL (screening) = 25
PsycINFO (screening) = 28
Sociofile (screening) = 1
Total from all databases for Key Questions 5 & 6 = 121
After duplicates, book chapters, foreign language articles and dissertations were removed, the total unduplicated count for KQ 5&6 = 96.














| Adj | adjusted |
| B | Bedford |
| BDI | Beck Depression Inventory |
| BDI-II | Beck Depression Inventory - II |
| C | Catego |
| CCEI | Crown-Crisp Experiental Index |
| CES | Current Experience Scale |
| CES-D | Center for Epidemiological Studies - Depression Scale |
| CI | confidence interval |
| CIDI-A | Composite International Diagnostic Interview - Auto |
| Dept | department |
| DIS | diagnostic interview schedule |
| DMC | Dyadic Mutality Code |
| DSM-III | Diagnostic and Statistical Manual for Mental Disorders, Third Edition |
| DSM-III-R | Diagnostic and Statistical Manual for Mental Disorders, Third Edition - Revised |
| DSM-IV | Diagnostic and Statistical Manual for Mental Disorders, Fourth Edition |
| dx | diagnosis |
| EPDS | Edinburgh Postnatal Depression Scale |
| GA | gestational age |
| GHQ | General Health Questionnaire |
| GHQ-D | General Health Questionnaire - Depression |
| GP | general practitioner |
| GP/psych | general practitioner/psychiatrist |
| HDRS | Hamilton Depression Rating Scale |
| HMO | health maintenance organization |
| HOME | Home Observation for Measurement of Environment |
| hr(s) | hour(s) |
| HS | high school |
| ICD-9 | International Classification of Diseases, Ninth Edition |
| IDD-10 | International Classification of Disease, Tenth Edition |
| IDS | Inventory of Depressive Symptomology |
| LQ | Leverton Questionnaire |
| MAACL | Multiple Affect Adjective Check List |
| MADRS | Montgomery and Asberg Depression Rating Scale |
| MDE | major depressive episode |
| MINI | Mini International Neuropsychiatric Interview |
| MINI-V4.4 | Mini International Neuropsychiatric Interview, Version 4.4 |
| mo(s) | month(s) |
| NA | not applicable |
| NICU | Neonatal Intensive Care Unit |
| No. | number |
| NPV | negative predictive value |
| NR | not reported |
| NS | not significant |
| Ob-Gyn | obstretrics and gynecology |
| OR | odds ratio |
| PAS | Psychiatric Assessment Schedule |
| PDSS | Postpartum Depression Screening Scale |
| PEG | Psycho Educational Group |
| PP | Postpartum |
| PPG | Postpartum Guidelines |
| PSE | Present State Examination |
| PSE-ID | Present State Examination - Index of Definition |
| RCT | randomized controlled trials |
| RDC | research diagnostic criteria |
| SADS | Schedule for Affective Disorders and Schizophrenia |
| SADS-C | Schedule for Affective Disorders and Schizophrenia - Change version |
| SADS-L | Schedule for Affective Disorders and Schizophrenia - Long |
| SCAN | Schedules for Clinical Assessment in Neuropsychiatry |
| SCID | Structured Clinical Interview for DSM-IV |
| SCID-German | Structured Clinical Interview for DSM-IV - German |
| SCIP-NP | Structured Clinical Interview for DSM-III-R - non-patient |
| SCLR-90 | Symptom checklist Revised - 1990 |
| SD | standard deviation |
| Sensi | sensitivity |
| Speci | specificity |
| SIDS | Sudden Infant Death Syndrome |
| SPI | Standardized Psychiatric Interview |
| SRQ | self-reported questionnaire |
| TSH | thyroid stimulating hormone |
| UK | United Kingdom |
| Univ. | University |
| USA | United States of America |
| vs. | versus |
| wk(s) | week(s) |
| yr(s) | year(s) |
Free Full text in PMC]This study was supported by Contract 290-02-0016 from the Agency for Healthcare Research and Quality (AHRQ), Task Order No. 4. We acknowledge the continuing support of Kenneth Fink, MD, MGA, MPH, Director of the AHRQ Evidence-Based Practice Center (EPC) Program, and Marian James, PhD, the AHRQ Task Order Officer for this project.
The investigators deeply appreciate the considerable support, commitment, and contributions of the EPC team staff at RTI International and the University of North Carolina (UNC). From UNC, we thank EPC Co-Director, Timothy S. Carey, MD, MPH; EPC Literature Search Specialist, B. Lynn Whitener, PhD; and Research Assistant Leah Randolph, MA. We also express our gratitude to Loraine Monroe, EPC word processing specialist, Debra Bost, EPC editor, and Kathleen Mohar, Manager, Publications Specialist Group at RTI International.
We extend our appreciation to the members of our Technical Expert Advisory Group (TEAG), who provided advice and input during our research process. The RTI-UNC EPC team solicited the views of TEAG members from the beginning of the project. TEAG members also provided insights into and reactions to work in progress and advice on substantive issues and overlooked areas of research. TEAG members participated in refining the analytic framework and key questions and discussing the preliminary assessment of the literature, including inclusion/exclusion criteria and methods for data synthesis. The TEAG was both a substantive resource and a “sounding board” throughout the study. It was also the body from which expertise was formally sought at several junctions. TEAG members are listed below:
Jeffrey Kuller, MD
Associate Professor
Division of Maternal-Fetal Medicine
Duke University Medical Center
Michael W. O'Hara, PhD
Professor
University of Iowa
Susan F. Meikle, MD, MSPH
Center for Outcomes and Evidence
Agency for Healthcare Research and Quality
Katherine L. Wisner, MD, MS
Director
Women's Behavioral HealthCARE
Professor of Psychiatry, Obstetrics and Gynecology and Reproductive Sciences and Epidemiology
University of Pittsburgh Medical Center
We gratefully acknowledge the following individuals who reviewed the initial draft of this report and provided us with constructive feedback. External reviewers comprised clinicians, researchers, representatives of professional societies, and potential users of the report. We would also like to extend our appreciation to David Atkins, MD, MPH from AHRQ for contributing peer review comments. Our peer review panel also includes all members of the TEAG. Peer review was a separate duty for these individuals and not part of their commitment as TEAG members. All are active professionals in the field. The peer reviewers were asked to provide comments on the content, structure, and format of the evidence report and to complete a checklist. The peer reviewers' comments and suggestions formed the basis of our revisions to the evidence report. Acknowledgments are made with the explicit statement that this does not constitute endorsement of the report.
Individuals
Cheryl Beck, DNSc, CNM, FAAN
Professor
School of Nursing
University of Connecticut
Diana Dell, MD, FACOG
Women's Behavioral Health Program
Department of Obstetrics and Gynecology
Duke University Medical Center
Judith Lumley, MD
Centre for Mother's and Children's Health
La Trobe University
Carlton, Victoria, Australia
Organizations
Shoshana Bennett, PhD
Postpartum Support International
Janet Chapin, RN, MPH
American College of Obstetricians and Gynecologists
Ruth Johnson, CNM
American College of Nurse-Midwives
Marlene Freeman, MD
American Psychiatric Association
Gwen Gjerdingen, MD, MS
American Association of Family Physicians
Sheila M. Marcus, MD
American Psychiatric Association
Laura J. Miller, MD
American Psychiatric Association
Darrel A. Regier, MD, MPH
American Psychiatric Association
Kimberly A. Yonkers, MD
American Psychiatric Association
Free Full text in PMC]
Free Full text in PMC]
Free Full text in PMC]