How the guidelines were developed

Publication Details

The guidelines were developed according to the WHO Handbook for Guideline Development. For a detailed discussion on the merits and challenges of applying the WHO process of guideline development to the domain of mental, neurological and substance use disorders, please see Barbui et al. (2010).

The scope

The WHO Secretariat initially proposed scoping questions. These questions included interventions for PTSD, bereavement and a range of symptoms that can occur in the first month after a potentially traumatic event. The questions did not focus on the acute stress reaction concept, as it is anticipated that this will no longer be classified as a mental disorder in ICD-11. Similarly, the questions did not focus on the acute stress disorder concept (which is not in the ICD), as this concept has poor predictive validity and its utility may depend on the need to have a diagnosis for making health insurance payments (Bryant et al., 2011). After three rounds of electronic consultation with the GDG, it was agreed that the guidelines should cover the management of the following problems in adults and children:

  • symptoms of acute stress in the first month after a potentially traumatic event, with the following subtypes:

    symptoms of acute traumatic stress (intrusion, avoidance and hyperarousal) in the first month after a potentially traumatic event;


    symptoms of dissociative (conversion) disorders in the first month after a potentially traumatic event;


    non-organic (secondary) enuresis in the first month after a potentially traumatic event (in children);


    hyperventilation in the first month after a potentially traumatic event;


    insomnia in the first month after a potentially traumatic event;

  • posttraumatic stress disorder (PTSD);
  • bereavement in the absence of mental disorder.

Further consultations with the GDG involved review of scoping questions phrased using the PICO (Population, Intervention, Comparison, Outcomes) format. Outcomes were listed, and the GDG voted to rank them according to importance using three levels (critical, important, not important). Individual scores were converted into a 9-point scale (critical = 8; important = 5; not important = 2), which were then averaged, and rounded off to obtain average levels of importance on a 9-point scale consistent with GRADE methodology.2

The process of drafting, reviewing and revising PICO questions occurred during June–July 2011, while the priority ranking of outcomes occurred in November 2011. One hundred percent of GDG members rated the outcomes.

Evidence search and retrieval

By the end of July 2011 a set of scoping questions had been finalized. These were then used to guide searches for relevant systematic reviews that had been performed within the last two years and met inclusion criteria (see evidence profiles 1–21 in Annex 5 for specific inclusion and exclusion criteria).

While only systematic reviews less than two years old at the time of the search were included, there were no age limitations on individual studies within those reviews. Although the same interventions were considered for adults and children, separate reviews were done. Evidence from the adult literature was not generalized to children.

Where relevant systematic reviews did not exist, were not recent (had not been done within the last two years) or were not of suitable quality or applicability, new systematic reviews were commissioned. Databases searched include MEDLINE, Medline In-Process, Embase, HMIC, PsycINFO, ASSIA and CINAHL using the Ovid interface. For the commissioned systematic review on medicines for PTSD, specific additional searches were carried out to identify international studies in Japanese, Chinese, French, Portuguese, Russian and Spanish (see evidence profiles in Annex 5 for more detail on search terms, databases searched).

Evidence to recommendations

While keeping in mind the strengths (transparency) and limitations (low inter-rater reliability) of GRADE, the WHO Handbook for Guideline Development was followed. The GRADE system, created to enable explicit assessment of the quality of evidence and use of evidence for developing recommendations, was used. When assessing the evidence base, methodologists (consultants supporting the GDG) summarized the evidence extracted from systematic reviews and meta-analyses into “Summary of findings” tables and graded the quality of evidence summarized in the tables (see Annex 5).

In the GRADE system, the “Quality of the evidence” is defined as the level of confidence that the estimate of the effect of an intervention is correct. The quality of evidence is rated as high, moderate, low or very low quality, as detailed in the table below.

During grading, evidence from randomized controlled trials (RCTs) begins as high quality, while that from observational study designs (e.g. non-randomized or quasi-randomized intervention studies, cohort studies, case control studies and other correlational study designs) begins as low quality. The quality of the evidence is then further assessed. Five criteria can be used to downgrade the evidence. These are:

  • Risk of bias: limitations in the study design that may bias the overall estimates of the treatment effect;
  • Inconsistency: unexplained differing estimates of the treatment effect (i.e. heterogeneity or variability in results) across studies;
  • Indirectness: the question being addressed by the guideline panel is different from the available evidence regarding the population, intervention, comparator or outcome;
  • Imprecision: results are imprecise when studies include relatively few patients and few events and thus have wide confidence intervals around the estimate of the effect;
  • Publication bias: systematic underestimate or overestimate of the underlying beneficial or harmful effect due to the selective publication (or reporting) of studies.

Three other criteria may be used to upgrade the quality of evidence rating: a strong association, a dose-response gradient and plausible confounding.

During the guideline development meeting in Amman, Jordan, the GDG was provided with evidence profiles summarizing the evidence retrieved, including evidence on values, preferences, benefits, harms and feasibility for 21 questions on specific interventions. Wherever possible, the evidence retrieved was graded and GRADE tables provided. A decision table was used by the GDG during the meeting to agree on the quality of evidence and certainty about harms and benefits, values and preferences, feasibility and resource implications (see Annex 5 for details of each question, evidence search, inclusion and exclusion criteria, decision tables). In several instances the group decided that the lack of randomized evidence on the effect of proposed interventions, coupled with uncertainty about harms and benefits, values and preferences, feasibility and resource implications, meant that no recommendation could be made at this time. This has been indicated in the list of recommendations.

The strength of the recommendation was set as either:

“Strong”: meaning that the GDG members agreed that the quality of the evidence combined with certainty about the values, preferences, benefits and feasibility of this recommendation meant that it should be followed in all or almost all circumstances;


“Standard”: meaning that there was less certainty about the combined quality of evidence and values, preferences, benefits and feasibility of this recommendation. Hence there may be circumstances in which it will not apply. The word “standard” (rather than “weak” or “conditional”) was chosen to be in line with earlier WHO mhGAP guidelines and also to avoid the negative connotations of the word “weak”, which could have risked biasing GDG members towards strong recommendations.

On the basis of summary text in the evidence profiles on quality of evidence, benefits versus harms, values and preferences (from an end-user perspective) and resource consumption (from a health services perspective), the following decision table was completed by the GDG to come to a decision on a strong versus a standard recommendation.

On a number of occasions, the GDG decided to give a strong recommendation despite a GRADE assessment of the available evidence on effect as being of “very low quality”. This occurred only when the following conditions applied: (a) there was certainty about the balance of benefits versus harms and burdens; (b) the expected values and preferences were clearly in favour of the recommendation; and (c) there was certainty about the balance between benefits and resources being consumed.

Occasionally it was not necessary to complete the table entirely when a partially filled table already indicated that the recommendation would have to be standard (e.g. if GDG members agreed that the answer was “No” to two questions, there was no need to ask the other two questions to decide on the strength of the recommendation as it would have to be standard). This saved scarce time during the meeting, given that discussions on the questions for each of the decision tables were often lengthy.

Group process

During the GDG meeting in Amman, decisions were usually made by consensus, but where there was disagreement a vote was taken and a two thirds majority was required for a decision to be carried. After any vote, GDG members who were in the minority were asked if they would want to reconsider their position. In all cases, this led to at least a two thirds majority.



The spreadsheet files used to collect and analyse these data are available upon request, and we believe that these may be a slight improvement on the traditional GRADE method of asking people to rate the importance of scoping questions. A possible problem with the traditional way of rating is that many respondents who consider a question as important may have difficulties assigning it a 4, 5 or 6 rating if they have been conditioned (though their school grading system) that the values of 4, 5 or 6 are low and thus do not correspond to an “important” rating. The GRADE way of asking participants to rate outcomes on a 9-point scale may create a bias towards rating outcomes higher than intended. This problem may be overcome by asking people to rate using three levels (critical, important, not important) only and then convert these to a 9-point scale.