Home > Discussion - Efficacy and Safety of...

PubMed Health. A service of the National Library of Medicine, National Institutes of Health.

Myers ER, Aubuchon-Endsley N, Bastian LA, et al. Efficacy and Safety of Screening for Postpartum Depression [Internet]. Rockville (MD): Agency for Healthcare Research and Quality (US); 2013 Apr. (Comparative Effectiveness Reviews, No. 106.)

  • This publication is provided for historical reference only and the information may be out of date.

This publication is provided for historical reference only and the information may be out of date.

Discussion

Key Findings and Strength of Evidence

In this comparative effectiveness review (CER), we reviewed 40 unique studies represented by 45 publications that evaluated tools for screening for postpartum depression, risk factors for postpartum depression, and factors influencing the effectiveness of screening for postpartum depression. The available evidence did not allow us to draw any conclusions about the balance of benefits and harms of screening specifically for postpartum depression, or whether specific tools or strategies would result in a more favorable balance.

KQ 1. Performance Characteristics of Screening Instruments

Although the included studies varied widely in country, language, setting, and timing of testing, estimates for both sensitivity and specificity were in the 80–90 percent range for most of the screening tests for which there was evidence. As expected, there was an inverse correlation between sensitivity and specificity: increased sensitivity was associated with decreased specificity when the threshold for an abnormal screening test was varied both within and between studies

Multiple studies were available only for the Edinburgh Postnatal Depression Scale (EPDS) and the Postpartum Depression Screening Scale (PDSS). Although heterogeneity in both the clinical characteristics of the population being screened and the threshold used precluded quantitative synthesis, the range of observed sensitivity and specificity for both of these tests fell within the 80–90 percent range. In the two studies that directly compared these two instruments, confidence intervals (CIs) for both sensitivity and specificity overlapped. There were also four studies for the Beck Depression Inventory (BDI), but different versions of the test were used. There were two studies of the “two-question” screen, both of which found a sensitivity of 100 percent if the response to either question were “yes,” but with markedly lower specificities than other tests (45.5% and 65.7%).

One Hungarian study of the 24-item Leverton Questionnaire reported sensitivity of 95.2 percent (95% CI, 90.4 to 98.1%) and specificity of 91.3 percent (95% CI, 88.4 to 93.7%). We did not identify any confirmatory studies in a U.S. setting.

Table 14 summarizes the strength of evidence for each screening test reviewed.

Table 14. Strength-of-evidence domains for test characteristics of screening tests for postpartum depression.

Table 14

Strength-of-evidence domains for test characteristics of screening tests for postpartum depression.

The probability of a false-negative or false-positive test result is a function of test sensitivity, specificity, and the prevalence of the underlying disorder. Table 15 illustrates the interaction of these three parameters, using the 80–90 percent range for sensitivity and specificity observed for most of the studies in our review. In the 2005 AHRQ evidence report,2,3 the estimated point prevalence of major depression at various points in the first 12 months after delivery was in the 4–8 percent range; the prevalence in the majority of studies included in this review was in the 10–20 percent range. There are approximately 4,000,000 deliveries annually in the United States.128 Table 15 shows the effect of prevalence, sensitivity, and specificity on the estimated annual number of true positives, false positives, and false negatives if all postpartum women are screened once during the postpartum period. It is clear from these numbers that, although a 10-percent difference in either sensitivity or specificity may appear relatively small, there are significant differences in both the number of missed diagnoses and the number of false positives across this range. Even at a relatively high prevalence, decreasing specificity from 90 to 80 percent results in over 300,000 additional false positives annually. Even if false-positive results have no individual harms, this would represent either a substantial strain on existing resources for evaluation of women with possible depression or require a substantial investment in additional resources. (The implications of this tradeoff if screening is repeated throughout the postpartum year are discussed below).

Table 15. Effect of prevalence of major depression on annual expected true positives, false positives, and false negatives in the United States at varying levels of sensitivity and specificity assuming a one-time postpartum screen.

Table 15

Effect of prevalence of major depression on annual expected true positives, false positives, and false negatives in the United States at varying levels of sensitivity and specificity assuming a one-time postpartum screen.

We did not identify any studies that compared the ability of individual items in specific instruments to correctly identify particular signs or symptoms of depression. One study found moderate agreement between the suicidal ideation item of the EPDS and a diagnostic instrument, but suicidal ideation was not significantly associated with any outcomes, including response to therapy. Another study compared prevalence of suicidal ideation based on the EPDS to another scale, the MOODS-SR. Prevalence of suicidal ideation was approximately twice as high on the on the EPDS, but the investigators did not formally compare agreement between the two or compare either to a reference standard. In this study, not surprisingly, suicidal ideation on the EPDS was significantly associated with a subsequent diagnosis of major depression.

KQ 2. Effect of Individual Factors on Screening Performance

Table 16 summarizes the strength of evidence for the individual factors identified in the included studies. Women with a history of previous psychiatric disorders, particularly mood disorders, and women in a poor-quality relationship or with low levels of social support, are at higher risk for postpartum depression. Although the heterogeneity in populations and instruments used to measure these domains precluded quantitative synthesis, the results were consistent across studies, with relatively large odds ratios of 2.0 or more, and were almost always statistically significant in multivariate analyses. Although strength of evidence for some individual risk factors within these broad categories was low (primarily based on single studies or wide CIs), the overall consistency leads to an assessment of moderate strength of evidence.

Table 16. Strength-of-evidence domains for associations with patient characteristics and risk of postpartum depression.

Table 16

Strength-of-evidence domains for associations with patient characteristics and risk of postpartum depression.

Chronic medical conditions and adverse pregnancy outcomes were also consistently associated with postpartum depression, but the smaller number of studies assessing these factors led to a low strength of evidence rating. With the exception of unemployment, there was insufficient evidence to assess the association between other maternal demographic factors and postpartum depression.

The majority of these factors are consistent predictors of postpartum depression in earlier studies included in other reviews,4 and it is possible that including older studies would have raised the overall strength of evidence based on greater consistency or precision. However, given that there is evidence that temporal trends in the methods used to classify subjects as depressed or nondepressed affect study results,3 this is not at all certain.

The purpose of our review of this literature was ultimately not to assess whether a given risk factor is or is not associated with postpartum depression, but whether screening women with the risk factor results in better test performance—and even more importantly—better clinical outcomes compared to screening women without the risk factor. We did not identify any studies (even observational studies) that made this direct comparison. This means that, even including additional studies, the strength of evidence that screening based on risk factors might improve performance would be moderate at best.

The potential clinical impact of better estimates of the association between a given risk factor (or group of factors) and postpartum depression is dependent not only on the strength of the association (as measured by the relative risk or odds ratio), but also on the baseline risk of postpartum depression and the prevalence of the risk factor—a common risk factor might result in a clinically significant increase in absolute risk even at low to moderate levels of increased relative risk. Given an estimate of the relative risk, the prevalence of the risk factor, and the incidence of postpartum depression, it is possible to estimate the absolute difference in incidence between those with and without the risk factor. This in turn would allow an estimation of how test characteristics, particularly positive and negative predictive value, would change if screening were conditional on the presence or absence of the risk factor. However, this estimate would again be indirect at best, and would require confirmation from more direct studies.

We did not identify any studies meeting our inclusion criteria that evaluated a risk prediction instrument (analogous to the use of risk prediction instruments such as the Gail model for breast cancer risk, which is used as a tool for deciding on both timing of screening and type of test).129,130 Multivariate predictive models can be characterized in terms of sensitivity and specificity. For screening for postpartum depression, a predictive model could be used to identify women with a higher pretest probability of depression (which in turn would improve positive predictive value). Alternatively, the results of a screening instrument could be incorporated into the model itself.

KQ 3. Effect of Testing Variables on Screening Performance

We identified only two studies that provided estimates of test performance based on timing, and the evidence was insufficient to assess whether the timing of screening relative to delivery affects sensitivity or specificity for any screening instrument. In one study judged to be at high risk of bias, test characteristics for four different screening instruments were similar when measured in the first 8 weeks after delivery compared with 2–6 months after delivery. We did not identify any studies directly comparing screening instrument performance across settings or type of provider (Table 17).

Table 17. Strength-of-evidence domains for the effect of varying timing on screening for postpartum depression.

Table 17

Strength-of-evidence domains for the effect of varying timing on screening for postpartum depression.

KQ 4. Comparative Benefits of Screening and KQ 5. Comparative Harms of Screening

We identified some evidence of benefit to screening compared with no screening or usual care, either through identifying higher risk women prior to delivery and implementing primary preventive strategies, or through screening and referral for treatment. Screening led to decreases in depressive symptoms as measured by repeated administration of the screening instruments themselves (low to moderate strength of evidence, with the strength of evidence from consistent results weakened because of poor to fair study quality and imprecise estimates), and improvement in the mental health component of a health-related quality-of-life instrument (low strength of evidence primarily due to a single fairly small study) (Table 18). Parental stress as measured by the Parental Stress Inventory (PSI) or the PSI-Short Form (PSI-SF) did not improve with screening and treatment of depressive symptoms in a poor-quality quasi-experimental study, two fair-quality RCTs, and one good-quality RCT (low strength of evidence due to mostly poor to fair study quality and lack of precision), despite improvement in depressive symptoms with screening and treatment in all four studies. These results are consistent with a 2008 systematic review of the association between treatment of maternal depression and child outcomes, which concluded, “Based on [ten] studies, there is some evidence of associations between successful treatment of parents’ depression and improvement in children’s symptoms and functioning, but treatment of postpartum depression may not be sufficient for improving cognitive development, attachment, and temperament in infants and toddlers.”16

Table 18. Strength-of-evidence domains for benefits and harms of screening for postpartum depression.

Table 18

Strength-of-evidence domains for benefits and harms of screening for postpartum depression.

It is important to note that the lack of improvement observed in the PSI in the studies in our review does not necessarily mean that screening and treatment for depression are ineffective in improving important aspects of the mother–infant relationship. Other possible explanations include (1) interventions that are effective in reducing depressive symptoms, when used alone, may not be sufficient to improve parenting, particularly in settings where parental stress or dysfunction is already high, (2) if sample sizes were based on change in response to a depression scale, and the PSI is not as sensitive to changes secondary to improved depressive changes, then the studies may have been underpowered to detect a difference in the PSI, (3) the impact of effective depression treatment on parenting takes longer to become evident than changes in depressive symptoms themselves, and (4) effective depression treatment could improve aspects of the mother–infant relationship not measured by the PSI. If part of the reason for emphasizing screening and treatment of depression in the postpartum period (compared to other points in adulthood) is to improve the mother–infant relationship, and longer term outcomes in the child, then identifying appropriate measures of this relationship—and appropriate study designs to measure them—needs to be a key research priority.

One fair-quality study found a statistically significant increase in the number of unscheduled doctor visits in the first 3 months after delivery for infants of screened women compared with unscreened women after adjusting for prescreen infant health status, but this difference was no longer significant by 12 months; it is unclear whether these visits represented inappropriate utilization. None of the other studies addressed potential harms of screening.

We did not identify any evidence that choice of screening instrument, timing of screening, setting, provider, or other factor affected the outcomes of screening.

KQ 6. Factors Affecting the Likelihood of an Appropriate Action After a Positive Screening Result

In general, rates of followup in women with positive screening test results in all of the studies included across all KQs were low, ranging from 0 to 30 percent. Differences in country, setting, population characteristics, screening instrument, and timing precluded synthesis across studies. Three studies allowed direct comparison of rates at different times during pregnancy and the postpartum period (Table 19). One study found significantly higher rates of referral when screening was performed during the delivery admission (100%) compared with 36 weeks gestation (33%) or at 6 weeks postpartum (15%; p<0.001),125 a second found a much smaller difference when comparing prenatal (33%) with postpartum (27%) screening (p=not statistically significant [NS]),123 and a third poor-quality study found higher rates of followup among postpartum women (17.9%) compared with antepartum women (0%) (p=NS).

Table 19. Strength-of-evidence domains for the effect of timing on rates of referral and treatment among women with a positive screening test for postpartum depression.

Table 19

Strength-of-evidence domains for the effect of timing on rates of referral and treatment among women with a positive screening test for postpartum depression.

Although we did not identify any studies that directly addressed potential differences in appropriate followup based on setting or provider, there is some intriguing indirect evidence that practice characteristics may be very important. Reported followup and treatment rates among women with a positive screening test or clinical suspicion of depression were substantially higher in a study where screening, diagnosis, and treatment all occurred within an integrated primary care practice119 than were observed in other studies where positive screening results required referral for further diagnosis and treatment.

Findings in Relation to What Is Already Known

Our review focused on studies published subsequent to the 2005 AHRQ evidence review on perinatal depression.2,3 Key findings of the 2005 AHRQ review included:

  • Patient characteristics in the identified studies did not reflect the diversity of the U.S. population of pregnant and postpartum women.
  • There was a lack of precision for estimates of test characteristics, particularly for test sensitivity.
  • There were widely overlapping confidence intervals for estimates, precluding indirect comparison across tests.
  • Relatively few studies were identified that directly compared results of multiple screening instruments.
  • There was overall better sensitivity of screening instruments for the detection of major depression compared with major and minor depression combined.
  • No studies compared screening with no screening.

Recommendations in the review included:

  • Designing and powering studies to improve the precision of sensitivity estimates, if a premium is placed on negative predictive value of screening
  • Including more diverse populations in studies
  • Directly comparing different screening instruments within studies
  • Conducting studies that evaluate a broader range of timing
  • Designing studies that compare screening with no screening

Our findings in this review were broadly consistent with the 2005 results. We did identify some studies that included more diverse U.S. populations (including the development of a Spanish-language version of one of the instruments for Latina populations105); studies directly comparing different screening instruments;85,86,96 and studies comparing screening with no screening.118,120,121 However, the overall strength of the evidence base is not much better now than it was in 2005. Given the amount of time needed to design, implement, analyze, and report trials of the size necessary to address many of these concerns, it is likely that most studies that considered the recommendations of the 2005 report in their design have not yet been published.

A 2009 report for the Institute of Medicine,4 while not a formal systematic review, broadly reviewed the evidence for screening and treatment of depression in parents, including postpartum depression, and drew heavily on topic-specific systematic reviews, including the 2005 AHRQ report. The IOM report emphasized the consistent observational evidence of an association between parental depression and adverse short- and long-term outcomes in children. Specific summary conclusions regarding screening included:

Although there is evidence for effectiveness of screening, it is most effective when systems are in place to ensure adequate followup and treatment (similar to the USPSTF assessment).

There is a lack of data on the effect of screening in the primary care setting on parental function, barriers to utilization of services, or the two-generation impact of depression.

Although effective screening tools are available, patients are only identified as parents during the prenatal period.

A variety of programs have focused on screening mothers during routine pregnancy and postpartum clinical visits and other child health visits. These approaches provide opportunities to identify individuals who are at a higher risk for depression, provide education and support, assess parental function, and link child development screening with maternal depression screening (although the report reached no conclusions about effectiveness).

Studies have examined screening for depression in parents—particularly mothers—in existing community programs (e.g., early Head Start, those serving homeless women, substance use disorder treatment, home visitation), where individuals who are at higher risk of depression are seen. Although these settings and programs offer opportunities to reach parents and their children at greater risk for depression, screening is not routine (and, again, evidence on overall effectiveness is limited).

Little information is available in either public or private settings about the complex process of implementing a systematic approach to maternal or paternal depression screening and followup, including time, resources needed, workforce and training competency and capacity, and the impact of engagement and education of depressed parents on themselves as well as their children.

The findings of our review are consistent with these other reviews as well as with the USPSTF review and recommendations for screening in adults: there are reasonably consistent estimates for the sensitivity and specificity of available screening instruments, and there is evidence that screening and treatment can improve depressive symptoms; but the effectiveness of screening is dependent on the availability of systematic resources for managing patients with positive screening results, with the task force explicitly recommending screening only if such resources are available (with a “C” recommendation against screening if they are not). We identified many of the same uncertainties noted in these previous reviews, including a lack of evidence that there are no harms associated with screening (as opposed to not reporting of harms), a lack of evidence that screening and treatment for depression directly improves maternal–infant functioning, and a lack of evidence on the optimal screening interval.

Applicability

The effects of interventions as determined in research studies do not always translate well to usual practice, where patient characteristics, clinical training, diagnostic workup, and resources may differ importantly from study conditions. Thus, we qualitatively assessed the applicability of the included studies to a broader U.S. perspective.131

Many included studies recruited populations whose demographics differed considerably from patients in the broader community. Overall, only 30 percent of included studies were conducted in the United States; the largest percentage was conducted in Europe or the UK (48%). Qualitatively, results in terms of test performance, risk factors, outcomes, or receipt of appropriate services did not consistently differ between U.S.–based studies compared to those conducted in other countries. Event rates for postpartum depression between countries differ significantly due to dissimilarities in social and cultural contexts (e.g., family structures, gender roles). Moreover, the health care system in the United States differs considerably from those in Europe and the UK, making it problematic to translate findings to the U.S. context. In addition, given large differences between countries in educational systems, social support resources, and other factors that contribute to longer term developmental outcomes, the extent to which effective treatment of postpartum depression may influence these longer term outcomes may differ as well. Many studies had highly selected samples due to high rates of nonresponse or attrition during the studies, which limits these findings to broader populations. The majority of studies were conducted in women in their late twenties to early thirties. Few studies were conducted with samples of older maternal age. Finally, the prevalence of major depression in studies estimating the sensitivity and specificity was substantially higher than U.S. population-based point-prevalence estimates, suggesting that the positive predictive value of any screening instrument in a low-risk population will be substantially lower than the estimates derived from validation studies.

The EPDS is the most widely known and used screening tool for postpartum depression: over two thirds of studies assessed postpartum depression with the EPDS. To the extent that the EPDS is considered “standard of care,” findings from these studies would have reasonable applicability. However, these studies used a range of cutoffs to signal probable postpartum depression (range: 8 to 13), and descriptions of testing protocols were not specific enough to inform routine clinical care. As discussed elsewhere, the choice of cutpoint has significant implications for clinical outcomes, at both individual patient level and health system level. Confidence intervals for sensitivity estimates for all screening tests were wide, and for the most, part sensitivity and specificity estimates were qualitatively similar. In addition, some studies administered the screening test in the perinatal through discharge period in a hospital setting—the results from this setting may not be representative of the results for screening in outpatient settings. There were few direct comparisons between screening instruments, and the studies that did directly compare instruments did not identify substantial differences. There were only a few studies that directly compared screening with any instrument to no screening, and, although they suggest an improvement in depressive symptoms, there are limited data on other maternal or infant health outcomes. Lastly, there is limited information on paternal outcomes.

It is also worth noting that the single U.S.–based study that demonstrated high rates of receipt of appropriate services and significant reductions with screening119 did so within the context of family physician practices where integrated screening, diagnosis, and treatment services were available. However, the most recent available data suggest that, in the United States, family physicians account for less than 10 percent of prenatal visits (with presumably a similar proportion for postpartum visits)132 and less than 20 percent of nonacute visits for children under 4 years of age.133 If the majority of care for women or infants is being provided in settings where integration of screening with appropriate mental health diagnostic and treatment services is not available, then these results are not broadly applicable without a major change in current patterns of obstetric and pediatric care, which is unlikely in the short term.

Implications for Clinical and Policy Decisionmaking

The 2005 AHRQ report concluded that there was a lack of evidence on the overall effectiveness of screening for depression in pregnancy or the postpartum period, lack of consensus on the appropriate target for screening (major depression alone vs. major and minor depression), and, if screening is performed, uncertainty about which instrument to use. These uncertainties are reflected in the recommendations by various stakeholder organizations discussed in the Introduction. The evidence reviewed for this report does little to resolve those uncertainties: we found some evidence that screening improves some maternal outcomes compared with no screening, but the overall effect of this improvement on longer term maternal and infant outcomes is unclear.

The USPSTF gives screening for depression in adults a “B” recommendation “when staff-assisted depression care supports are in place to assure accurate diagnosis, effective treatment, and follow-up” and a “C” recommendation against routine screening “when staff-assisted depression care supports are not in place.”5 Since the current evidence suggests that the prevalence of depression in postpartum women is similar overall to that in other women of reproductive age, these recommendations should be as applicable to women during the postpartum period as at any other. Our evidence review found low rates of appropriate followup in the majority of studies, with a notable exception in a trial where screening, diagnosis, and treatment were all available within the same primary care setting,119 which is consistent with the USPSTF review.

If screening for depression during the postpartum period is especially important because of the potential impact on both mother and child, and if screening for depression is effective only when adequate resources are available to ensure appropriate followup, then the major policy implication of this report is that much greater attention needs to be paid to an explicit definition of the goals of a postpartum depression screening strategy. No matter what methods are used to ensure appropriate followup, the resources required are directly dependent on the test characteristics of the screening test, as discussed throughout this report. A small decline in specificity can result in a large absolute increase in the number of positive results, most of which will be false positives. The choice of optimal test and test thresholds, testing algorithms, and test frequency need to be made based on an explicit consideration of the tradeoff between false-positive and false-negative results.

Potential Value of Simulation Modeling

The lack of evidence for the benefits and harms of screening ultimately contributes to the difficulty in identifying the optimal screening test and strategy. There is clearly a tradeoff between false-positive and false-negative test results (Table 20). Given estimates of the point prevalence of depression of 3–7 percent in the postpartum period2,3 and the range of sensitivities and specificities of the most commonly used screening instruments, it seems likely that the number of false-positive results are likely to exceed the number of true-positive results with the use of any single screening instrument. In the absence of direct evidence, one method for estimating the balance of benefits and harms is to use a simulation model. As described in the Methods, we adapted an existing model of pregnancy, the postpartum period, and infancy78 to generate preliminary estimates of these tradeoffs using the available evidence, including the existing uncertainty surrounding the estimates of sensitivity and specificity for currently available tests.

Table 20. Estimated annual number of true positives, false positives, and false negatives in the United States from screening with “single test” versus “serial tests”.

Table 20

Estimated annual number of true positives, false positives, and false negatives in the United States from screening with “single test” versus “serial tests”.

One strategy to reduce the number of false-positive results would be to use serial testing with a highly sensitive test first, followed by a highly specific test in patients with positive results on the first test—a strategy frequently used in other contexts (for example, use of nontreponemal tests for syphilis, followed by more specific treponemal antigen tests in positive patients134). One possible option would be to use the two-question screen, which had a reported sensitivity in two studies of 100 percent with specificities of 44 and 65 percent,91,100 followed by a second screening test in women with a positive answer to either of the two questions, as suggested by Gjerdingen et al.91

Table 20 shows the expected number of false positives and false negatives for a one-time screen with (a) one of seven screening tests alone or (b) using one of the tests only after a positive response to one of the two questions that make up the two-question screen. This analysis assumes a prevalence of postpartum depression of 5.8 percent at 2 months postpartum (the highest point prevalence estimate in the 2005 AHRQ report) and universal screening. The estimates shown are the result of 10,000 simulations using randomly selected point estimates for sensitivity and specificity from the studies reviewed for KQ 1. Serial testing has a small effect on false-negative rates but substantially decreases false-positive rates for all tests. This decrease is most dramatic for tests with lower specificity. (Confidence intervals for the estimates are not shown in Table 20, but there is considerable overlap between tests—this table should not be used to draw inferences for between-test comparisons.) As noted above, even if a false-positive result does not have any significant impact on health outcomes at the individual level, evaluating and ruling out depression in women with false-positive screening results increases the workload for existing service providers and creates the need for additional resources, which may not be readily available, particularly for providers caring for vulnerable populations where resources are already constrained.

A better understanding of the tradeoffs between harms and benefits would help to identify the optimal test and strategy. As an example, Figure 9 presents the results of a microsimulation comparing no screening, screening with the EPDS alone, screening with the Postpartum Depression Screening Scale (PDSS) alone, screening with two questions followed by the EPDS, or screening with two questions followed by the PDSS. For each simulation (n=10,000), the value for test sensitivity and specificty were randomly drawn from the distributions described in each study described in KQ 1. (The probability of a specific study being chosen was a uniform distribution, the specificity was drawn from a beta distribution based on the study-specific values, and the sensitivity was drawn from a function based on the selected specificity value and a log-normal distribution of the study-specific diagnostic odds ratio, in order to account for the negative correlation between sensitivity and specificity.135) Prevalence was drawn from a beta distribution based on the estimated point prevalence at 2 months in the 2005 AHRQ report. Results are shown as an “acceptability curve,” where the tradeoff between false positives (equivalent to costs in a cost-effectiveness analysis) and treated depression (the measure of effectiveness) is considered using a “willingness-to-pay” threshold—in this case, how many false positives per treated depression is a decisionmaker willing to accept? The optimal strategy is the one that has the highest net value at a given willingness-to-pay. The x-axis varies the ratio of false positives to detected cases from 0 to 10, while the y-axis depicts the proprortion of simulations where a given strategy was optimal. For example, if no false positives are acceptable, then no screening is always optimal, given that none of the screening strategies has a specificity of 100 percent. As the “acceptable” ratio increases, the proportion of simulations where no strategy would be preferred to any of the alternatives decreases. Values of acceptabilty where there is little difference between strategies indicate that the uncertainty surrounding the values of the parameters is too great to distinguish between them.

The y-axis illustrates the proportion of simulations where a given strategy was optimal at a given acceptability threshold for the ratio of false positives/treated depression. This figure shows that serial testing is almost always favored over a single test; that there is minimal difference between the EPDS and PPDS given the available evidence; and that, even with serial testing, there is likely to be a high number of false positives associated with screening. Note that the curves for “Screen Once EPDS” and “Screen PPDS” are virtually identical and overlap.

Figure 9

Acceptability curve for tradeoff between false positives (“costs”) and treated depression (“effectiveness”) at different thresholds for false positives/treated depression ratio (“willingness-to-pay”). EPDS (more...)

Figure 9 shows the following: serial testing is almost always favored over a single test; there is minimal difference between the EPDS and PDSS given the available evidence; and, even with serial testing, there is likely to be a high number of false positives associated with screening. If additional evidence were available on the clinical harms (as well as costs) associated with a false-positive result, making a recommendation for or against screening (either screening of any type or with a specific test) would be much easier.

Consensus on the relative importance of false positives and false negatives will also help in selecting study thresholds, or in the design of new screening strategies. Many of the studies we reviewed selected a screening threshold based on the value that maximized the area under the receiver operating characteristic (ROC) curve. If a false positive and a false negative are equally bad, then choosing the threshold that optimizes both is reasonable; however, if the relative importance of the outcomes associated with each incorrect test result is different, then that difference needs to be included in the criteria for selecting the threshold. The frequency of testing, along with the natural history of the target condition, is also important—if the target condition is unlikely to worsen between screening intervals, then optimizing specificity over sensitivity might be reasonable, whereas optimizing sensitivity might be better for a one-time screen.

In the studies reviewed, followup rates for women with positive screening results were uniformly low. The impact of these low followup rates on the overall effectiveness of screening is unclear. The false-positive rate of most of the screening instruments studied is high. Therefore, if the majority of women who did not get further evaluation after screening represented women who were truly not depressed, then the overall effectiveness of screening might not be substantially worsened. On the other hand, if women with true-positive results are equally likely (or even more likely) to not follow up as women with false-positive results, screening effectiveness (and cost-effectiveness) is adversely affected. Without either better evidence about the possibility of differential followup rates or systems in place to maximize appropriate followup for screen positives, implementing screening could lead to a significant waste of resources, including both provider and patient time. This may be particularly problematic for those providing services for low-income populations, where resources for mothers and infants are already under considerable strain. Although we did not find evidence for substantial differences in screening instrument performance based on timing relative to delivery, there was some evidence for higher rates of followup when screening was performed closer to delivery (although, given the inconsistency of the results and findings related to setting, this may be related primarily to greater ease of access of referral services around the time of delivery). The risk for postpartum depression appears to continue at least through the first 12 months after delivery.2,3 The best estimate for cumulative incidence from birth to 12 months in the 2005 AHRQ report was approximately 30 percent (roughly 3% per month). This ongoing risk suggests that screening throughout the postpartum period might be necessary to maximize the detection of depression, particularly if doing so is necessary to optimize parenting.

However, as screening frequency increases, so does the likelihood of false-positive results for both individuals and the population—this effect has been clearly been demonstrated with cancer screening models.136 Estimating the impact of different screening frequencies in a cohort of postpartum women is difficult, even with an estimate of incidence, since the point prevalence at any given time is a function of (a) incidence, (b) the duration of symptoms/condition, and (c) the proportion of symptomatic women who will be diagnosed in between screening intervals. For illustration, we can make assumptions favorable to screening, including (1) all of the new cases of depression will remain undiagnosed if screening is not performed, (2) none of the new cases will spontaneously remit in the absence of screening, (3) all women with true-positive results receive treatment, and (d) since women with false-positive results at one screening test will still be at risk for developing depression, they will be rescreened at the next scheduled time.

During each screening round, some women will have true-positive results and be removed from the cohort. At the next screening round, the total number of women with depression will be the sum of new cases among nondepressed women (true negatives and false positives in the previous round) and cases that were missed (false negatives) in the previous round. Table 21 shows the expected cumulative number of true positives, false positives, and false negatives in a cohort of 4 million women (the approximate number of deliveries in the United States annually) if screening is performed at a postpartum visit at 6 to 8 weeks, with subsequent screens during well-child visits at 3, 6, 9 and 12 months. We used the best estimates for prevalence at 6 to 8 weeks (8%), and cumulative incidence (approximately 30% at 12 months, or 3% per month) from the 2005 AHRQ report, at three different levels of sensitivity and specificity consistent with the ranges found in our review.

Table 21. Estimated number of true positives, false positives, and false negatives with screening at postpartum and well-child visits.

Table 21

Estimated number of true positives, false positives, and false negatives with screening at postpartum and well-child visits.

Even at a specificity of 90 percent, repeated testing results in a 40 percent chance of having at least one false-positive test result in the first postpartum year; at lower levels of specificity, well over half of all women would have at least one false-positive result.

Limitations of the Comparative Effectiveness Review Process

There were several limitations to our review. We limited our search to English-language articles for two main reasons: a lack of translation resources, and a priority for studies that were applicable to U.S. populations. It was the opinion of the investigators and the Technical Expert Panel (TEP) that the resources required to translate non-English articles would not be justified by the low potential likelihood of identifying relevant data unavailable from English-language sources. To the extent that studies relevant to screening for postpartum depression in the U.S. population might be published in languages other than English, we may have failed to include relevant studies.

Because there was substantial overlap between our KQs and the KQs considered in the 2005 AHRQ review, we focused our search on articles published subsequent to the last date in the search conducted for that report. The major overlap in topic between the two reports is in the test characteristics of specific screening instruments; it is possible that abstraction of some of the articles included in the 2005 report might have allowed formal synthesis of sensitivity/specificity estimates for some tests at some thresholds; however, given the heterogeneity between studies, it seems unlikely that any additional clarity about relative test performance would have been achieved. As discussed above, inclusion of studies on risk factors for postpartum depression published prior to 2004 might have led to more precise estimates of the association, assuming no temporal trends in the use of specific diagnostic criteria, although it is unlikely that these earlier studies would have provided more direct evidence that screening based on the presence of risk factors results in different clinical outcomes.

We restricted included articles on test performance and outcome to those which used a reference diagnostic interview or instrument in all positive subjects and all or a random sample of screen negatives. The low rates of followup for clinical diagnosis are also seen in research studies, which may lead to selection bias in studies which require a reference standard.92 To the extent that the effective interventions are available for specific symptoms detected by a screening instrument, even if diagnostic criteria for depression are not met, this requirement may also underestimate some of the clinical benefits of screening.

Limitations of the Evidence Base

As noted above, many of the limitations of the evidence base noted in the 2005 AHRQ report2,3 and the 2009 IOM report4 are still present and include the following:

  • Patient characteristics in the applicable studies that do not reflect the diversity of the U.S. population of pregnant and postpartum women, or which are focused on high-risk populations only. Although we identified some studies conducted in more diverse populations, additional studies are needed. This is particularly important given the need to increase the precision of estimates of test characteristics and more accurately determine the potential for variations in the prevalence of depression across diverse populations.
  • Relatively few high-quality studies comparing results for multiple screening instruments, either through randomization or by administering different instruments to the same subject.
  • Relatively few high-quality studies comparing formal screening to no screening or usual care; we identified only two fair-quality randomized controlled trials (RCT). Lack of evidence for benefit associated with detecting symptoms of depression that together do not meet criteria for a diagnosis of major depression. Such evidence would be extremely helpful in setting thresholds for a positive test, as well as helping define the overall benefits of screening.
  • Lack of evidence for harms associated with screening (a lack that was also noted in the USPSTF review of depression screening in the general adult population5). Potential harms of a false-positive result at the individual level (or of a true-positive result when effective treatment is not available) include stigmatization and anxiety. Other than one study that reported a short-term increase in the number of unscheduled doctor visits in infants of screened women (and where there was ambiguity about whether these visits were appropriate or not), we did not identify any studies that reported on outcomes for all women with positive results rather than limiting the reporting to only those women with a confirmatory diagnostic evaluation.
  • Lack of evidence for an impact of screening and treatment of depression on longer term maternal and infant outcomes. This is ultimately needed to help in the weighing of harms versus benefits when deciding if, when, and whom to screen for postpartum depression. Although the consistent association between postpartum depression and a variety of adverse outcomes in infants and children is often cited as one of the primary rationales for screening, there is little or no direct evidence that screening and treatment leads to improved outcomes compared to no screening. Three studies of different design and different setting found no significant improvement in the PSI, a commonly used measure of parental stress among women screened and treated for postpartum depression, despite improvement in depressive symptoms. Whether this lack of change is an issue related to different levels of effectiveness of the interventions studies for depression and parenting, responsiveness of the specific measure used, or aspect of study design such as sample size, these results suggest that detection and treatment of depression alone may not be sufficient to lead to improved child outcomes. Given that many of the social, relationship, and personality factors consistently associated with postpartum depression are also likely to be associated with suboptimal development outcomes in children, some evidence that, for example, treating depression in a single mother in a poor-quality relationship will lead to improved outcomes in children, even if the social factors do not change, would be helpful to strengthen the case for screening.
  • Finally, one of the biggest barriers to synthesizing this literature is the diversity in research methods, definitions, and analytic tools used. Given the interdisciplinary nature of the condition, this diversity can be extremely helpful in bringing fresh insights to the problem. However, because of differences in preferred methods between fields, synthesis of results can be challenging. Even when the same technique is used, the results may be reported differently. For example, even though logistic regression is commonly used across a wide range of research as a method for multivariable analysis, different fields report the results differently. Medical and epidemiologic studies will report odds ratios and confidence intervals, while some studies we reviewed in the psychological literature reported pseudo-R2 values, or other summary statistics. This barrier was also specifically cited by the IOM in its review of depression in parents.4

Research Gaps

General Gaps

Understanding the potential benefits and harms of screening for postpartum depression is an issue of considerable interest to patients, clinicians, and policymakers. Section 2952 of the 2010 Patient Protection Affordable Care Act provides for funding for research related to postpartum depression,137 and there are two current funding opportunities from NIH specifically targeting mental health during pregnancy and the postpartum period.138,139 This review has identified a number of research gaps that could be addressed utilizing these resources.

As noted above, one of the major limitations of the current evidence base is the wide disparity in methods and definitions used in studies relevant to screening for postpartum depression. This disparity limits the ability to synthesize the existing literature across disciplines; in particular, it significantly limits the ability to perform meta-analyses. It would be extremely valuable for researchers in the field to reach consensus on a core set of measures that would be reported consistently across all relevant studies. For studies of interventions, common outcomes measures are the highest priority. For observational studies, or other study designs where there is a need to adjust for potential confounding, common measures for both outcomes and confounders are needed. In practice, this means not only agreement on which variables to collect, but how to measure and report them. For example, parity is frequently reported as a mean and standard deviation, which not only is clinically meaningless (since noninteger values of number of deliveries have no interpretation) but also does not reflect the underlying distribution.

For many of the recommendations below, use of formal simulation and decision models may prove useful. As described above, even a simple model can be helpful in illustrating tradeoffs and can highlight the relationship between uncertainty about the relative likelihood of adverse outcomes compared to favorable outcomes, the acceptable harm/benefit tradeoff, and the extent to which further research will help clarify the optimal decision or recommendation. This approach can be done using both specific clinical outcomes, or it can explicitly incorporate costs; in the latter case, this value-of-information analysis can help inform research prioritization and research budgeting.80,140 Further development of the model outlined in this report could incorporate variations in strategies, such as timing of screening relative to delivery, repeated screening at varying intervals during pregnancy and the postpartum period, use of strategies to target high risk groups for screening, and strategies to enhance followup and treatment of women with positive screening results.

KQ 1

  • Although greater precision for sensitivity estimates would be useful, there will always be greater uncertainty about sensitivity than specificity in a screening setting, since the number of subjects with the underlying condition will always be much smaller than the number of subjects without the condition. Given this limitation, it would ultimately be more efficient to perform studies large enough to address the question directly rather than multiple additional smaller studies, particularly if the smaller studies focus on a single instrument. We would suggest the following:
    1. Achieving consensus on the appropriate tradeoff between false positives and false negatives and using thresholds defined by these clinical criteria to determine optimal sensitivity and specificity for candidate screening instruments. As discussed above, even fairly small differences in test characteristics can translate into large differences in the likelihood of an accurate test result, with significant implications for both the individual patient and the larger health care system.
    2. Determining other criteria for evaluating screening instruments (ease of administration, time associated with administration, costs, patient and provider acceptability, etc.). These criteria could be collected as part of the study. Alternatively, patient and provider acceptability could be measured using methods such as discrete choice experiments to assess the relative importance of different attributes of the screening test;141 these data could then be used to inform the choice of which instruments to evaluate further.
    3. Defining sample size for the study based on detecting clinically relevant differences in test performance and acceptability, with these differences being at least partially derived empirically in the first two steps.
    4. Directly comparing candidate instruments, either by having the same subject use each instrument (randomized as to order of administration) or by randomizing different subjects to different instruments. The tradeoff here is between the increased generalizability of having subjects take a single test versus overall sample size.
    5. These considerations should include an explicit discussion of screening frequency during the postpartum period, since this has significant implications for both the cumulative probability of a false-positive result as well as for the setting where screening is most likely to occur.
  • The question of whether different instruments are better at identifying specific signs and symptoms is only important if there are effective interventions for those specific signs and symptoms. Clarity is needed on which signs and symptoms, and what potential interventions are available, in order to discuss potential research designs. One first step might be a systematic review focused on the individual signs and symptoms identified in the different screening instruments, with an emphasis on identifying effective interventions.
  • If a large part of the goal of screening for depression is to improve longer term child outcome through improved functioning of the mother–infant dyad, then consideration should be given to characterizing the sensitivity and specificity of screening tests or algorithms, both existing ones and new ones, based on their ability to predict or detect maladaptive functioning or longer term adverse outcomes.

KQ 2

  • Although we identified a number of consistent risk factors for postpartum depression, we did not identify any articles that used a multivariate predictive model to stratify patients by risk of developing the condition in order to screen more efficiently (similar to the Gail model, which is used to identify women at higher risk of breast cancer for more aggressive screening protocols). The potential impact of such a model could be estimated based on the absolute risk of postpartum depression at different thresholds and then using this information to estimate the number of false positives and false negatives resulting from screening only women identified as high risk. This could be compared to the estimated number of unwanted screening outcomes resulting from other strategies designed to minimize false positives, such as serial testing, using a simulation model. These data could, in turn, be used to estimate the size, costs, and value-of-information of a comparative trial.

KQs 3–6

  • There was insufficient direct evidence to address the effect of timing, setting, or provider on test characteristics. It seems plausible that differences in clinical outcome relevant to timing, setting, or provider are more directly related to aspects of the process of screening, referral, and diagnosis rather than to differences in the test characteristics of the specific screening instrument used in the study. In other words, studies that compare the effects of timing, setting, or provider on overall clinical outcomes should be a higher priority for research resources than studies that only compare sensitivity and specificity of screening instruments by timing, setting, or provider.
  • Additional RCTs comparing organized screening with usual care are needed. Ideally, some of these studies could address issues relevant to differences in timing, setting, or provider, perhaps through factorial designs.
  • Explicit definitions of harms and benefits are needed and would necessarily be part of any formal discussion on appropriate targets for sensitivity and specificity.
  • Parental stress should be included in studies of screening and treatment of maternal depression. Furthermore, the relationship between stress, depression, and other important outcomes should be carefully explored.
  • The use of a two-question screen followed by a standardized screening instrument in women who answer yes to one of the questions would appear to have substantial potential to improve screening efficiency based on reported test characteristics and a simple model; future screening studies in the United States should strongly consider including this approach as one of the study arms.
  • Ideally, these studies should include a long-term followup component for both mothers and infants. Although this will substantially affect costs and timing of the studies, if the ultimate rationale for screening involves both maternal and child outcomes, then a more explicit demonstration of the benefits in terms of these longer term outcomes is needed.
  • If longer term studies are not feasible, and the rationale for screening during the postpartum period is strengthened by the potential to improve longer term outcomes through improving the maternal–infant relationship, then studies should incorporate valid and sensitive measures of this relationship that are reliable surrogates for longer term outcomes. To the extent that scores on measures of depression may be more sensitive to depression treatment than scores on measures of parental function, consideration should be given to designing and powering studies to detect clinically meaningful differences in parental functioning as the primary outcome. A depression screening and intervention study powered to detect a difference in a parental functioning outcome would be likely to have sufficient power to detect improvement in depression symptoms, whereas the converse may not the case.
  • There was low strength evidence that timing might affect likelihood of receiving appropriate diagnostic and therapeutic services, and reported receipt of appropriate diagnostic and therapeutic services was much higher in two studies where screening, diagnosis, and treatment were available from the same provider.
Cover of Efficacy and Safety of Screening for Postpartum Depression
Efficacy and Safety of Screening for Postpartum Depression [Internet].
Comparative Effectiveness Reviews, No. 106.
Myers ER, Aubuchon-Endsley N, Bastian LA, et al.

AHRQ (US Agency for Healthcare Research and Quality)

PubMed Health Blog...

read all...

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...