NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

Warren Z, Veenstra-VanderWeele J, Stone W, et al. Therapies for Children With Autism Spectrum Disorders. Rockville (MD): Agency for Healthcare Research and Quality (US); 2011 Apr. (Comparative Effectiveness Reviews, No. 26.)

Cover of Therapies for Children With Autism Spectrum Disorders

Therapies for Children With Autism Spectrum Disorders.

Show details


This chapter documents procedures that we used to develop this comparative effectiveness review on the treatment of autism spectrum disorders (ASDs) in children ages 2–12. We first describe our strategy for identifying articles relevant to our key questions, our inclusion/exclusion criteria, and the processes used to abstract relevant information from eligible articles and generate the evidence table. We also discuss our criteria for grading the quality of individual articles and for rating the strength of the evidence as a whole.

Literature Review Methods

Inclusion and Exclusion Criteria

Our inclusion/exclusion criteria were developed in consultation with the Technical Expert Panel (TEP). Criteria are summarized below (Table 6).

Table 6. Inclusion and exclusion criteria.

Table 6

Inclusion and exclusion criteria.

For this review, the relevant population for key questions (KQ) one through six was children with ASDs (autism, Pervasive Developmental Disorder-Not Otherwise Specified (PDD-NOS), Asperger syndrome) whose mean age plus standard deviation was ≤ 12 years and 11 months. Studies needed to provide adequate information to ensure that participants fell within the target age range. Specifically, we chose to limit the age range to 2–12 because a) diagnosis of ASDs earlier than age 2 is less established and b) adolescents likely have substantially different challenges and would warrant different interventions than children in the preschool, elementary and middle school age groups. We did, however, add one question (KQ7) focusing on children under age 2; children in this age group are not definitively diagnosable, but may be at risk either because they have a sibling with ASDs, or they may be exhibiting signs suggestive of a possible ASD diagnosis.

We excluded studies that included fewer than 10 total participants for studies of behavioral, educational, allied health, or complementary and alternative medicine (CAM) interventions or fewer than 30 total participants for medical studies. We selected these criteria in consultation with our content experts as a minimum threshold for comparing interventions. We believed that given the greater risk associated with the use of medical interventions, it was appropriate to require a greater sample size to accrue adequate data on safety and tolerability, in addition to efficacy. We restricted the review to medical studies with at least 30 participants given that most studies of medical interventions for ASD with fewer than 30 subjects report preliminary results that are replaced by later, larger studies. This restriction did not eliminate specific medical therapies from the review as treatments are typically assessed in larger studies following their preliminary investigation. Moreover these sample size constraints are not uncommon in the systematic review/comparative effectiveness review literature.

We accepted any study designs except individual case reports, and our approach to categorizing study designs is presented in Appendix F. Our interest was in identifying the effectiveness of interventions that target core and commonly associated symptoms of ASDs, compared with other intervention or no interventions.

We note that if a research study used a comparison group that did not contribute to an estimate of the contrast of interest in our review, we included the one arm of the study that was relevant. For example, an intervention study in which the intervention group is children with ASDs and the comparison group is a group of children with Down Syndrome would not provide an estimate of the effect of the intervention for children with ASDs. Rather than exclude this study, we include the group of children with ASDs as a case series.

We recognize that setting a minimum of 10 participants for studies to be included effectively excluded much of the literature on behavioral interventions using single-subject designs. Because there is no separate comparison group in these studies they would be considered case reports (if only one child included) or case series (multiple children) under the rubric of the EPC study designs. Case reports and case series can have rigorous evaluation of pre- and post- measures, as well as strong characterization of the study participants, and case series that included at least 10 children were included in the review.

Single-subject design studies can be helpful in assessing response to treatment in very short timeframes and under very tightly controlled circumstances, but they typically do not provide information on longer term or functional outcomes, nor are they ideal for external validity without multiple replications.97 They are useful in serving as demonstration projects, yielding initial evidence that an intervention merits further study, and, in the clinical environment, they can be useful in identifying whether a particular approach to treatment is likely to be helpful for a specific child. Our goal was to identify and review the best evidence for assessing the efficacy and effectiveness of therapies for children with ASD, with an eye toward utility in the treatment setting. With the assistance of our technical experts, we selected a minimum sample size of 10 in order to maximize our ability to describe the state of the current literature, while balancing the need to identify studies that could be used to assess treatment effectiveness.

As the team lacked translators for potentially relevant non-English studies, we also excluded studies that were not published in English. In addition, we excluded studies that:

  • Did not report information pertinent to the key questions
  • Were published prior to the year 2000 (the revision of the Diagnostic and Statistical Manual of Mental Disorders, Fourth Edition (DSM-IV) and widespread implementation of gold standard assessment tools including the Autism Diagnostic Observation Schedule (ADOS) and the Autism Diagnostic Interview–Revised [ADI-R])
  • Were not original research
  • Did not present aggregated results (i.e., included data for individual participants only) or presented graphical data only.

Literature Search and Retrieval Process

Databases. We employed search strategies provided in Appendix A to retrieve research on the treatment of autism spectrum disorders, including Asperger syndrome and Pervasive Developmental Disorder, Not-Otherwise-Specified. Our primary literature search employed three databases: MEDLINE® via the PubMed interface, PsycINFO (psychology and psychiatry literature), and the Education Resources Information Center (ERIC), searched from 1980 to the present. We also hand-searched the reference lists of all included articles to identify additional studies for review.

Grey literature. The AHRQ Scientific Resource Center also searched for information on the two medications specifically approved for treating irritability in ASDs (risperidone and aripiprazole) in resources including the websites of the US Food and Drug Administration and Health Canada and clinical trials registries such as We gave manufacturers of these medications as well as of hyperbaric oxygen chambers an opportunity to provide additional information.

Search terms. Controlled vocabulary terms served as the foundation of our search in each database, complemented by additional keyword phrases to represent ASDs in the clinical and educational literature. We also employed indexing terms when possible within each of the databases to exclude undesired publication types (e.g., reviews, case reports, news), items from non-peer-reviewed journals, and items published in languages other than English.

Our searches were executed between May 2009 and May 2010. Appendix A provides our search terms and the yield from each database.

Article selection process. Once we identified articles through the electronic database searches, review articles, and bibliographies, we examined abstracts of articles to determine whether studies met our criteria, including the cutoff date of the year 2000. Two reviewers separately evaluated each abstract for inclusion or exclusion, using an Abstract Review Form (Appendix B). If one reviewer concluded that the article could be eligible for the review based on the abstract, we retained it. The group included 3 expert clinicians (WS, ZW, JV), and two senior health services researchers (MM, RJ). Two reviewers assessed the full text of each included article using a standardized form (Appendix B); disagreements between reviewers were resolved by a third-party adjudicator.

Categorization of Interventions

As has been previously noted, ASD intervention categories overlap substantially, and it is difficult to cleanly identify the category into which an intervention should be placed.14 We considered multiple approaches for organizing the results, and note that no alternative approaches would have changed our overall findings either in terms of outcomes or strength of evidence for any category of intervention.

Behavioral interventions. We defined behavioral interventions to include early intensive behavioral and developmental interventions, social skills interventions, play/interaction-focused interventions, interventions targeting symptoms commonly associated with ASDs such as anxiety, and other general behavioral approaches.

Early intensive behavioral and developmental interventions. We adopted a similar approach to the operationalization of the early intensive behavioral and developmental intervention category as Rogers and Vismara12 in their review of “comprehensive” evidence-based treatments for early ASDs. Interventions in this category all have their basis in or draw from principles of applied behavior analysis (ABA), with differences in methods and setting. ABA is an umbrella term describing principles and techniques used in the assessment, treatment and prevention of challenging behaviors and the promotion of new desired behaviors. The goal of ABA is to teach new skills, promote generalization of these skills, and reduce challenging behaviors with systematic reinforcement. The principles and techniques of ABA existed for decades prior to specific application and study within ASDs.

We include in this category two intensive manualized (i.e., have published treatment manuals to facilitate replication) interventions: the University of California, Los Angeles (UCLA)/Lovaas model and the Early Start Denver Model (ESDM). These two interventions have several key differences in their theoretical frameworks and implementation, although they share substantial similarity in the frequent use of high intensity (many hours per week, one-on-one) instruction utilizing ABA techniques. They are described together here because of these similarities. We note, however, that the UCLA/Lovaas method relies heavily on one-on-one therapy sessions during which a trained therapist uses discrete trial teaching with a child to practice target skills, while ESDM blends ABA principles with developmental and relationship-based approaches for young children.

The other treatment approaches in this category also incorporate ABA principles, and may be intensive in nature, but often have not been manualized. We have classified these approaches broadly as UCLA/Lovaas-based given their similarity in approach to the Lovaas model. A third particular set of interventions included in this category are those using principles of ABA to focus on key pivotal behaviors rather than global improvements. These approaches emphasize parent training as a modality for treatment delivery (e.g., Pivotal Response Training, Hanen More than Words, social pragmatic intervention, etc.) and may focus on specific behaviors such as initiating or organizing activity or on core social communication skills. Because they emphasize early training of parents of young children, they are reviewed in this category.

Social skills interventions. Social skills interventions focus on facilitating social interactions and may include peer training and social stories.

Play/interaction-focused interventions. These approaches use interactions between children and parents or researchers to affect outcomes such as imitation or joint attention skills or the ability of the child to engage in symbolic play.

Interventions focused on behaviors commonly associated with ASDs. These approaches attempt to ameliorate symptoms such as anger or anxiety, often present in ASDs, using techniques such as Cognitive Behavioral Therapy (CBT) and parent training focused on challenging behaviors.

Additional behavioral interventions. We categorized approaches not cleanly fitting into the behavioral categories above in this group, which includes interventions such as sleep workshops and neurofeedback.

Educational interventions. Educational interventions are those focusing on improving educational and cognitive skills and intended primarily to be administered in educational settings, or studies for which the educational arm was most clearly categorized. These interventions include programs such as the Treatment and Education of Autistic and Communication related handicapped CHildren (TEACCH) model and other treatments implemented primarily in the educational setting. Some of the interventions implemented in educational settings are based on principles of ABA and may be intensive in nature, but none of the educational interventions described in this report used the UCLA/Lovaas or ESDM manualized treatments.

Medical and related interventions. We broadly defined medical and related interventions as those that included the administration of external substances to the body in order to treat symptoms of ASDs; medical interventions represented in the literature included in this review comprised prescription medications, supplements and enzymes, diet therapies, and treatments such as hyperbaric oxygen.

Allied health interventions. Allied health interventions included therapies typically provided by occupational and physical therapists, including auditory and sensory integration, music therapy and language therapies.

Complementary and alternative medicine (CAM) interventions. Approaches in this category addressed in this review include acupuncture and massage.

Literature Synthesis

Development of Evidence Table and Data Abstraction Process

The staff members and clinical experts who conducted this review jointly developed the evidence table, which was used to abstract data from the studies. We designed the table to provide sufficient information to enable readers to understand the studies, including issues of study design, descriptions of the study populations (for applicability), description of the intervention, appropriateness of comparison groups, and baseline and outcome data on constructs of interest. We also abstracted data about harms or adverse effects of therapies, defined by the EPC program as the totality of all possible adverse consequences of an intervention.98

The team abstracted several articles into the evidence table and then reconvened as a group to discuss the utility of the table design. We repeated this process through several iterations until we decided that the table included the appropriate categories for gathering the information contained in the articles. All team members shared the task of initially entering information into the evidence table. Another member of the team also reviewed the articles and edited all initial table entries for accuracy, completeness, and consistency. The full research team met regularly during the article abstraction period and discussed global issues related to the data abstraction process. In addition to outcomes related to treatment effectiveness, we abstracted all data available on harms. Harms encompass the full range of specific negative effects, including the narrower definition of adverse events.

The final evidence table is presented in its entirety in Appendix C. Studies are presented in the evidence table chronologically and alphabetically by the last name of the first author within each year. When possible to identify, analyses resulting from the same study were grouped into a single entry. A list of abbreviations and acronyms used in the table appears at the end of this report.

Several reporting conventions for describing studies in the evidence table were adopted that warrant explanation, namely those related to practice setting, intervention setting, and assessments. We developed a brief taxonomy of the most common practice settings to reflect the entity that conducted the research. Practice settings include:

  • Academic (comprises academic medical centers and universities)
  • Community
  • Specialty treatment centers
  • Residential centers
  • Private practice
  • Other (including pharmaceutical companies).

We developed a similar listing for intervention settings to reflect where the intervention was implemented, including home, school, clinic, and residential center. We considered the default setting for drug studies to be the clinic (even if medication was provided by caregivers in the home). Behavioral interventions involving the clinician in both the home and clinic were coded as occurring in both settings.

We captured data on the conduct of assessments in order to inform the evaluation of quality of study conduct and to address questions of applicability of the intervention outcomes data to different populations of children with ASDs; data reported include the assessment conducted (e.g., ADOS), the context and administrator of the assessment (e.g., administered by study psychologist in the clinic), and the timing (pre-intervention and at the six and eight week study visit, etc.).

Assessing Methodological Quality of Individual Studies

We used a components approach to assessing the quality of individual studies, following methods outlined in the EPC Methods Guide for Effectiveness and Comparative Effectiveness Reviews.99 The individual quality components are described here. Individual quality assessments for each study are reported in Appendix H.

In some instances, it was appropriate to apply specific questions only to one body of literature (e.g., to medical literature) and we note those cases where appropriate. Each domain described below was assessed individually and combined for an overall quality level using the algorithm below. Three levels were possible: good, fair, and poor.

Study design. Ideally, studies should use a comparison group in order to make causal inferences. The comparison group should accurately represent the characteristics of the intervention group in the absence of the intervention. Specifically, factors that are likely to be associated with the intervention selected and with outcomes observed should be evenly distributed between groups, if possible. These factors may include, for example, age, intelligence quotient (IQ), or ASD severity. Four questions were used to assess the study design:

  1. Did the study employ a group design (have a comparison group)?
  2. Were the groups randomly assigned?
  3. If no, was there an appropriate comparison group?
  4. If yes, was randomization done correctly?

We considered the following elements in determining the appropriateness of a study’s randomization methods: Were random techniques like computer-generated, sequentially numbered opaque envelopes used? Were technically nonrandom techniques, like alternate days of the week used? Was the similarity between groups documented?

Scoring: Studies with a group design were marked as minimally meeting this domain (+). Those that also received an affirmative response for either question three or four exceeded that minimum (++).

Diagnostic approach. We expected studies to accurately characterize participants, and in particular to ensure that study participants purported to be on the autism spectrum had been diagnosed as such using a validated approach. We developed the hierarchy of diagnostic approaches below to capture the method used; Table 7 includes more information about each approach.

Table 7. Overview of diagnostic tools used in quality scoring hierarchy.

Table 7

Overview of diagnostic tools used in quality scoring hierarchy.

  1. Was a valid diagnostic approach for ASDs used within the study, or were referred participants diagnosed using a valid approach?
    1. A clinical diagnosis based on the DSM-IV, in addition to the ADI-R and ADOS assessments.
    2. A clinical diagnosis based on the DSM-IV, in addition to either the ADI-R or ADOS assessment.
    3. A combination of a DSM-IV clinical diagnosis with one other assessment tool from Table 8; or the ADOS assessment in combination with one other assessment tool from Table 8.
    4. Either a clinical DSM-IV-based diagnosis alone or the ADOS assessment alone.
    5. Neither a clinical DSM-IV-based diagnosis nor the ADOS assessment
Table 8. Quality scoring algorithm.

Table 8

Quality scoring algorithm.

Scoring: We classified diagnostic approaches A and B as gold standard (++), C and D as adequate (+) and E as unacceptable (−).

Participant ascertainment. The means by which participants enter the study cohort and are included in the analysis should be clearly described so that the reader can gauge the applicability of the research to other populations, and to identify selection and attrition bias. In this literature, it is important to understand the population in terms of characteristics commonly associated with outcomes such as IQ, language and cognitive ability. We used four questions to assess participant ascertainment, including who was included in the analysis:

  1. Was the sample clearly characterized (e.g., information provided to characterize participants in terms of impairments associated with their ASDs, such as cognitive or developmental level)?
  2. Were inclusion and exclusion criteria clearly stated?
  3. Do the authors report attrition?
  4. Were characteristics of the drop-out group evaluated for differences with the participant group as a whole?

Scoring: Studies minimally had to have an affirmative answer for questions one or two of this domain to be adequate (+). Affirmative responses on questions three or four were considered superior (++).

Intervention characteristics. Sufficient detail should be provided on the intervention so that the reader can fully understand the treatment and so that the research is potentially reproducible. This includes information on dosage, formulation, timing, duration, intensity and other qualities of the intervention. Furthermore, for behavioral treatments there should be some assurance that the treatment providers stayed true to the treatment process (fidelity) and for medical treatment, there should be some assurance that participants adhered to their medication or that adherence was accounted for. Furthermore, because other treatments occurring simultaneously with the treatment under study could have substantial impact on outcomes, it is important that authors gather data on treatments being obtained by their participants outside of the study. We used three questions to obtain quality information in this domain, and allowed for the intervention description to be provided in another, referenced paper:

  1. Was the intervention fully described?
  2. Was treatment fidelity monitored in a systematic way? (for non-medical interventions)
  3. Did the authors measure and report adherence to the intended treatment process? (for medical interventions)
  4. Did the authors report differences in or hold steady all concomitant interventions?

Scoring: Authors needed to fully describe the intervention for the study to be awarded one point (+), and studies were given an additional point (++) if they also reported on or held steady concomitant interventions and monitored either fidelity or adherence.

Outcomes measurement. The ASD literature reviewed for this report included more than 100 outcome measures. To understand the meaning of the results at hand, readers need to be confident that the measure validly assessed the intended target behavior or symptom. It is also important that authors specify a priori what their outcome of primary interest is as the rest of the study, including sample size, should derive from the intent to measure this outcome. Finally, in measuring outcomes, the individual responsible for coding or measuring effect should be blinded to what intervention the participant received. We attempted to use three questions for this domain, but were forced to drop one regarding whether primary outcomes were pre-determined as it was almost uniformly impossible to tell whether authors had a “called shot” or a priori primary outcome, or to tell which of several outcomes was the primary one. We were left with two questions:

  1. Did outcome measures demonstrate adequate reliability and validity (including inter-observer reliability for behavior observation coding)?
  2. Were outcomes coded and assessed by individuals blinded to the intervention status of the participants?

Scoring: To meet the requirement for an adequate score on outcomes measurement (+), studies were required to have an affirmative answer to both questions.

Statistical analysis. Studies could either have appropriate or inappropriate analysis. We used a series of questions to guide the determination:

  1. For RCTs, was there an intent-to-treat analysis?
  2. For negative studies, was a power calculation provided?
  3. For observational studies, were potential confounders and effect measure modifiers captured?
  4. For observational studies, were potential confounders and effect measure modifiers handled appropriately?

Confounders are variables that are associated both with the intervention and the outcome and that change the relationship of the intervention to the outcome. These are variables that we would control for in analysis. Effect measure modifiers are variables that we think of as stratifying, in that the relationship between the intervention and outcome is fundamentally different in different strata of the effect modifier. Observational research should include an assessment of potential confounders and modifiers, and if they are observed, analysis should control for or stratify on them. Other considerations included: was the candidate variable selection discussed/noted?, was the model-building approach described? Were any variables unrelated to the studied variables that could have altered the outcome handled appropriately? Were any variables not under study that affected the causal factors handled appropriately? Was the candidate variable selection discussed/noted?

Scoring: Studies needed a yes or not applicable (NA) on each of the analysis questions to receive a point (+) for analysis.

Scores were calculated first by domain and then summed and weighted as described in Table 8 to determine overall study quality (internal validity).

Applicability. Finally, it is important to consider the ability of the outcomes observed to apply both to other populations and to other settings (especially for those therapies that take place within a clinical/treatment setting but are hoped to change behavior overall). Our assessment of applicability took place in three steps. First, we determined the population, intervention, comparator, and setting (PICOS) in each study and developed an overview of these elements for each intervention category (Appendix I). Second, we reviewed potential modifiers of effect of treatment to identify subgroups for which treatments may be effective, and finally, we answered the following three questions:

  1. Were outcomes measured in at least one context outside of the treatment setting?
  2. Were outcomes measured in natural environments to assess generalization?
  3. Considerations: Was an assessment conducted in the home, school, or community settings (i.e., a setting a child typically goes to in an ordinary week)?
  4. Were followup measures of outcome conducted to assess maintenance of skills at least 3 months after the end of treatment?

These ratings of applicability do not factor into a study’s overall quality score (good, fair, or poor), nor are they part of strength of evidence. Rather they are presented separately and are discussed in Chapter 4.

Strength of Available Evidence

The assessment of the literature is done by considering both the observed effectiveness of interventions and the confidence that we have in the stability of those effects in the face of future research. The degree of confidence that the observed effect of an intervention is unlikely to change is presented as strength of evidence, and it can be regarded as insufficient, low, moderate, or high. Strength of evidence describes the adequacy of the current research, both in terms of quantity and quality, as well as the degree to which the entire body of current research provides a consistent and precise estimate of effect. Interventions that have demonstrated benefit in a small number of studies but have not yet been replicated using the most rigorous study designs will therefore have insufficient or low strength of evidence to describe the body of research. Future research may find that the intervention is either effective or ineffective.

Methods for applying strength of evidence assessments are established in the Evidence-based Practice Centers’ Methods Guide for Effectiveness and Comparative Effectiveness Reviews99 and are based on consideration of four domains: risk of bias, consistency in direction of the effect, directness in measuring intended outcomes, and precision of effect. Strength of evidence is assessed separately for major intervention-outcome pairs. We also required at least 3 fair studies to be available to assign a low strength of evidence rather than considering it to be insufficient. For determining the strength of evidence for effectiveness outcomes, we only assessed the body of literature deriving from studies that included comparison groups. We required at least one good study for moderate strength of evidence and two good studies for high strength of evidence. In addition, to be considered “moderate” or higher, intervention-outcome pairs needed a positive response on two out of the three domains other than risk of bias.

For determining the strength of evidence related to harms, we also considered data from case series. Once we had established the maximum strength of evidence possible based upon these criteria, we assessed the number of studies and range of study designs for a given intervention-outcome pair, and downgraded the rating when the cumulative evidence was not sufficient to justify the higher rating. The possible grades were:

  • High: High confidence that the evidence reflects the true effect. Further research is unlikely to change estimates.
  • Moderate: Moderate confidence that the evidence reflects the true effect. Further research may change our confidence in the estimate of effect and may change the estimate.
  • Low: Low confidence that the evidence reflects the true effect. Further research is likely to change confidence in the estimate of effect and is also likely to change the estimate.
  • Insufficient: Evidence is either unavailable or does not permit a conclusion.


Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...