NCBI » Bookshelf » Health Services/Technology Assessment Text (HSTAT) » AHRQ Evidence Reports » Criteria for Determining Disability in Speech-Language Disorders: Evidence Report/Technology Assessment Number 52
 
hserta
AHRQ Evidence Reports
public health

Chapter  52:  Criteria for Determining Disability in Speech-Language Disorders: Evidence Report/Technology Assessment Number 52

A79043

Prepared for:
Agency for Healthcare Research and Quality
U.S. Department of Health and Human Services
2101 East Jefferson Street
Rockville, MD 20852

http://www.ahrq.gov/

Contract No. 290-97-0011

Prepared by:
Research Triangle Institute Evidence-based Practice Center at
the University of North Carolina at Chapel Hill
Investigators
Andrea K. Biddle, Ph.D, M.P.H
Linda R. Watson, Ed.D
Celia R. Hooper, Ph.D.
Kathleen N. Lohr, Ph.D.
Sonya F. Sutton, B.S.P.H.

AHRQ Publication No. 02-E010

January 2002

This report may be used, in whole or in part, as the basis for development of clinical practice guidelines and other quality enhancement tools, or a basis for reimbursement and coverage policies. AHRQ or U.S. Department of Health and Human Services endorsement of such derivative products may not be stated or implied.

AHRQ is the lead Federal agency charged with supporting research designed to improve the quality of health care, reduce its cost, address patient safety and medical errors, and broaden access to essential services. AHRQ sponsors and conducts research that provides evidence-based information on health care outcomes; quality; and cost, use, and access. The information helps health care decisionmakers -- patients and clinicians, health system leaders, and policymakers -- make more informed decisions and improve the quality of health care services.

This document is in the public domain and may be used and reprinted without permission except those copyrighted materials noted for which further reproduction is prohibited without the specific permission of copyright holders.

Prepared for:
Agency for Healthcare Research and Quality
U.S. Department of Health and Human Services
2101 East Jefferson Street
Rockville, MD 20852

http://www.ahrq.gov/

Contract No. 290-97-0011

Prepared by:
Research Triangle Institute Evidence-based Practice Center at
the University of North Carolina at Chapel Hill
Investigators
Andrea K. Biddle, Ph.D, M.P.H
Linda R. Watson, Ed.D
Celia R. Hooper, Ph.D.
Kathleen N. Lohr, Ph.D.
Sonya F. Sutton, B.S.P.H.

AHRQ Publication No. 02-E010

January 2002

This report may be used, in whole or in part, as the basis for development of clinical practice guidelines and other quality enhancement tools, or a basis for reimbursement and coverage policies. AHRQ or U.S. Department of Health and Human Services endorsement of such derivative products may not be stated or implied.

AHRQ is the lead Federal agency charged with supporting research designed to improve the quality of health care, reduce its cost, address patient safety and medical errors, and broaden access to essential services. AHRQ sponsors and conducts research that provides evidence-based information on health care outcomes; quality; and cost, use, and access. The information helps health care decisionmakers -- patients and clinicians, health system leaders, and policymakers -- make more informed decisions and improve the quality of health care services.

This document is in the public domain and may be used and reprinted without permission except those copyrighted materials noted for which further reproduction is prohibited without the specific permission of copyright holders.

Suggested Citation

Biddle A, Watson L, Hooper C, et al. Criteria for Determining Disability in Speech-Language Disorders. Evidence Report/Technology Assessment No. 52 (Prepared by the University of North Carolina Evidence-based Practice Center under Contract No 290-97-0011). AHRQ Publication No. 02-E010. Rockville, MD: Agency for Healthcare Research and Quality. January 2002.

Preface

The Agency for Healthcare Research and Quality (AHRQ), through its Evidence-Based Practice Centers (EPCs), sponsors the development of evidence reports and technology assessments to assist public- and private-sector organizations in their efforts to improve the quality of health care in the United States. The reports and assessments provide organizations with comprehensive, science-based information on common, costly medical conditions and new health care technologies. The EPCs systematically review the relevant scientific literature on topics assigned to them by AHRQ and conduct additional analyses when appropriate prior to developing their reports and assessments.

To bring the broadest range of experts into the development of evidence reports and health technology assessments, AHRQ encourages the EPCs to form partnerships and enter into collaborations with other medical and research organizations. The EPCs work with these partner organizations to ensure that the evidence reports and technology assessments they produce will become building blocks for health care quality improvement projects throughout the Nation. The reports undergo peer review prior to their release.

AHRQ expects that the EPC evidence reports and technology assessments will inform individual health plans, providers, and purchasers as well as the health care system as a whole by providing important information to help improve health care quality.

We welcome written comments on this evidence report. They may be sent to: Acting Director, Center for Practice and Technoloy Assessment, Agency for Healthcare Research and Quality, 6010 Executive Blvd., Suite 300, Rockville, MD 20852.

John M. Eisenberg, M.D.Robert Graham, M.D.
DirectorDirector, Center for Practice and
Agency for Healthcare Research and Quality  Technology Assessment
 Agency for Healthcare Research and Quality
The authors of this report are responsible for its content. Statements in the report should not be construed as endorsement by the Agency for Healthcare Research and Quality or the U.S. Department of Health and Human Services of a particular drug, device, test, treatment, or other clinical service.

Structured Abstract

Objectives

Approximately 42 million Americans have some type of communication disorder, costing the nation $30 billion to $154 billion for lost productivity, special education, and medical care annually. The quality of the numerous evaluation procedures and instruments for clinical decisionmaking about language, speech, or voice disorders influences decisions about access to services and funding (e.g., special education services, Social Security disability income). The RTI-University of North Carolina at Chapel Hill Evidence-based Practice Center conducted a systematic review of the literature to address two key questions about evaluating and diagnosing speech and language disorders in adults and children of particular concern to the Social Security Administration in making disability eligibility determinations: (1) What instruments have demonstrated reliability, validity, and normative data? (2) Do these instruments have predictive validity for an individual's communicative impairment, performance, or both?

Search Strategy

We conducted detailed searches of the English-language literature from 1966 to October 2000 using the MEDLINE, CINAHL, PsycLIT®, ERIC, Health and Psychosocial Instruments, and Cochrane Collaboration databases.

Selection Criteria

We included all English-language research on 18 instruments for children and adults in which investigators evaluated the instrument's reliability, validity, or ability to predict future communicative impairment or functioning. Excluded were articles reporting the efficacy or effectiveness of specific interventions that did not provide information on the key questions, articles providing normative data from non-US populations, and all gray literature (i.e., literature not from peer-reviewed sources) except instrument manuals. An independent expert panel knowledgeable in language, speech, or voice disorders had identified the instruments we reviewed.

Data Collection and Analysis

We selected studies from among 1,238 citations using a process of duplicate, independent review of titles, abstracts, and, where necessary, full papers. We abstracted data on 92 articles or manuals, using single abstraction with subsequent review by clinical and methodological experts; reviewers also completed quality rating forms. Criteria used to evaluate reliability, validity, and other data reflect widely accepted or known standards for the psychometric properties of such instruments.

Main Results

Among language disorder instruments, one (of three) for adults and four (of eight) for children met or nearly met our evaluation criteria for reliability and validity; two child-specific instruments provided data for subpopulations. Although these five instruments had norms, only the child-specific instruments provided nationally representative data. Two (of three) instruments for voice disorders met evaluation criteria; speech disorder instruments did not. Only four studies gave information on prediction of future communicative functioning and impairment.

Conclusions

Reliability and validity data for the majority of instruments rarely came from peer-reviewed literature; instrument manuals yielded most such data. Some manuals provided comprehensive data from well-conducted standardization studies; most did not. Because normative data were usually not derived from nationally representative samples, generalizing results beyond the populations studied was difficult. Sample size and representativeness problems limited the predictive validity studies. Overall, evidence about diagnostic or predictive properties of instruments addressing language, speech, and voice disorders is weak and incomplete at this time. The sparse evidence base suggests a substantial methodologic, clinical, and policymaking research agenda.

Summary

Overview

Approximately 42 million people (1 in 6) in the United States have some type of communication disorder. Of these, 28 million have communication disorders associated with hearing loss, and 14 million have disorders of speech, voice, and/or language not associated with hearing loss. The personal and societal costs of these disorders are high. On a personal level, such disorders may affect nearly every aspect of daily life. Estimates of annual societal costs in the United States range from $30 billion to $154 billion in lost productivity, special education, and medical costs.

Over the last several decades, researchers and clinicians have developed a vast array of assessment instruments for speech, voice, and language; one source reviewing commercially available assessment instruments includes more than 140 tools in its most recent edition. Important clinical decisions follow from the assessment of a person with a communication disorder. These clinical decisions affect an individual's access to services and funding (e.g., eligibility for special education services, third-party payer coverage of treatment, and Social Security disability income).

Thus, the quality of the evaluation procedures on which such decisions are based is an important issue for individuals with a communication disorder, the clinicians involved in their evaluation and treatment, and the policymakers with fiscal responsibilities for services to individuals with these disorders. This evidence report, prepared by staff of the RTI-University of North Carolina at Chapel Hill Evidence-based Practice Center (RTI-UNC EPC) is directed to audiences who must grapple with this set of issues.

Reporting the Evidence

The clinical questions in this report were developed in conjunction with the Social Security Administration (SSA) to assist the agency in reviewing its criteria for determining disability in individuals with speech or language disorders, or both. Currently, disability determination depends on the functional limitations individuals experience, either with respect to employment in adults or with respect to the major life activities of children or adolescents (for example, school or play).

Therefore, in evaluations of individuals with speech and language disorders, the SSA is concerned with the concurrent relationship between the degree of impairment as measured by the assessment instrument and functional limitations associated with the speech or language impairment. Another commonality in the definitions of disability in children and adults is that the disability must be expected to last for at least 12 months or to result in death during that period. This criterion leads to a second important concern for the SSA, which is to know what evidence is available for various speech and language assessment instruments regarding their predictive power for future functioning of an individual. The SSA is interested in children and adults who (1) are English-speaking and have normal hearing, with or without normal cognition; (2) are non-English-speaking and have normal hearing, with or without normal cognition; (3) are mentally retarded; (4) have learning disorders; and (5) are hard of hearing.

Based on concerns related to the criteria and process for determining disability in children and adults, the SSA outlined two key questions as the basis for this report. First, do the 18 reviewed instruments have demonstrated reliability, validity, and normative data? Second, are there instruments with demonstrated predictive validity for the individual's communicative impairment and performance?

Methodology

Search Process and Inclusion Criteria

The task of synthesizing the available evidence on all speech and language evaluation instruments was clearly too large an undertaking to complete within the scope of this project. Thus, EPC staff had to select and prioritize instruments in such a way as to address the critical informational needs of the SSA while also limiting the scope to fall within the contractual boundaries of the project. To do this, we assembled a panel of 10 national experts, our Technical Expert Advisory Group (TEAG). They, along with Agency for Healthcare Research and Quality (AHRQ) and SSA staff, identified 19 instruments for literature review and evidence analysis-three each for adult language, adult speech, child speech, and voice, and eight for child language disorders. One speech instrument can be used with both adults and children and thus was counted twice. We later excluded one instrument because it was not a single instrument but instead was an approach to conducting more comprehensive clinical analysis of phonological patterns for which standard "diagnostic test characteristics" would be hard to determine.

The RTI-UNC EPC review team conducted detailed searches of the relevant English-language literature from 1966 (or the initiation of the specific electronic database) to October 2000 using the MEDLINE®, CINAHL, PsycLIT®, ERIC, Health and Psychosocial Instruments (HAPI), and Cochrane Collaboration databases. We initially excluded all gray literature. After reviewing abstracts for eligibility, however, we recognized that, for many instruments, data on reliability and validity could be found only in the instrument manuals. Thus, we expanded efforts to include instrument manuals in the review. We also examined reference lists of all included articles and instrument manuals to identify additional studies.

The EPC team applied a series of inclusion and exclusion criteria to the literature searches. Essentially, we included all English-language research on the selected instruments in children and adults (ages 18 through 62) in which the study evaluated the instrument's reliability, validity, or ability to predict future communicative impairment and/or functioning (i.e., predictive validity). Articles reporting the efficacy or effectiveness of speech or language therapy that did not provide information relevant to the key questions were excluded. Because of the need to address issues facing the SSA in establishing disability criteria in the United States, we excluded articles providing normative data from populations other than the United States.

The EPC team selected studies for inclusion from among 1,238 citations using a process of duplicate but independent review of titles, abstracts, and, where necessary, full papers. Discussion leading to consensus was used to resolve disagreements. The number of citations reviewed ranged from three, for the Dysarthria Examination Battery (DEB) and Voice Handicap Index (VHI), to 256, for the Test of Language Development (TOLD).

The team abstracted data, using single abstraction with subsequent review by clinical and methodological experts, from 92 articles whose abstracts met inclusion criteria. Two reviewers with expertise in quantitative psychology and experience in the validation and standardization of educational tests abstracted the data. During the data abstraction phase, we eliminated 53 articles because they did not meet inclusion criteria or did not address the version of the instrument selected by TEAG members.

The EPC study director and clinical experts completed a quality rating for each article and manual. The quality rating scales evaluated research design and conduct, measurement of reliability and validity, development of instrument norms, justifications for conclusions, and external validity concerns. Six additional items evaluated aspects of instrument development or revision for the instrument manuals.

The team compiled the data into a series of five evidence tables for each instrument. The first of these tables provides information on the study design and conduct and the quality scores assigned by the methodologist and the expert clinicians. The subsequent four tables describe the reliability, validity, predictive validity for future communicative functioning, and available normative data found in the reviewed articles and manuals.

Subsequently, the team graded the evidence summarized in the tables, assessing whether the evidence met thresholds for acceptable reliability, validity, and availability of normative data. Where relevant, we used classic criteria for clinical decisionmaking about individuals, not groups of subjects. The criteria employed were:

  • Reliability -- the criterion for reliability is "strictly" met if the following three conditions are all met:

    • Internal consistency reliability, measured using either Cronbach's coefficient alpha or Kuder-Richardson statistics (K-R 20), is greater than or equal to 0.90;

    • Test-retest/intra-rater reliability is greater than or equal to 0.90 if measured using a correlation coefficient, or greater than or equal to 0.80 if measured using Cohen's Kappa; and

    • Inter-rater reliability is greater than or equal to 0.90 if measured using a correlation coefficient, or greater than 0.80 if measured using Cohen's Kappa.

  • Validity -- the criterion for validity is met if the following conditions are all met:

    • Instrument developers examine relationships between subtests, composite scores, and total scores, establishing hypothesis a priori for these relationships and for patterns of scores for individuals belonging to various groups of import;

    • These relationships are all statistically significant at p < 0.05; and

    • In the case of correlation coefficients, the magnitude of the relationship is at least 0.30, thus providing evidence of a moderate correlation.

  • Normative Data -- the criterion for normative data is met if the following conditions are all met:

    • Data are available for the population targeted by the instrument;

    • An adequate sample size is used (i.e., at least 100 per group); and

    • Evidence is provided on how well the sample represents the population.

Some might reasonably argue that we set the criterion for internal consistency reliability too high given the complexity of speech and language functioning and disorders. Additionally the variability in daily performance that arises from these different speech and language disorders suggests that our criterion for test-retest reliability or intra-rater reliability was also set too high. Thus, we defined a "relaxed" criterion, which differs from the strict criterion in that internal consistency reliability may be as low as 0.80 and/or test-retest/intra-rater reliability may be as low as 0.80 (correlation) or 0.70 (Cohen's Kappa). The relaxed criterion is at a level suitable for having confidence in group, rather than individual comparisons.

After grading the psychometric properties of the individual instruments, we graded the strength of the overall body of evidence for groups of instruments identified by age group and disorder. We graded instrument manuals and peer-reviewed literature separately employing the following definitions for both.

  • Acceptable: research or analyses were well conducted, had representative samples of reasonable size, and met our psychometric evaluation criteria discussed earlier.

  • Unacceptable: studies were poorly conducted, used small or nonrepresentative samples, or had results that did not meet or only partially met the psychometric criteria.

Findings

Reliability, Validity, and Availability of Normative Data

The EPC team evaluated the strength of evidence describing the reliability, validity, and availability of normative data separately for instruments assessing adult language, child language, adult speech, child speech, and voice disorders.

Adult Language Instruments

The Porch Index of Communicative Ability (PICA) met our relaxed standards of evidence for both reliability and validity, as did the original version of the Western Aphasia Battery (WAB); however, one small study suggested that the WAB might not consistently classify patients with aphasia. The Boston Diagnostic Aphasia Examination, 2nd Edition (BDAE-2) met neither the reliability nor validity criterion.

Although normative data are available for two of the instruments, these data were derived from individuals treated at single institutions. Information was insufficient to assess whether they are representative of typical aphasics.

Child Language Instruments

Three tests -- the Clinical Evaluation of Language Fundamentals, 3rd Edition, Spanish Edition (CELF-3Sp), the Test of Language Development, Primary, 3rd Edition, (TOLD-P:3), and the Test of Language Development, Intermediate, 3rd Edition, (TOLD-I:3) -- met the standards we established for reliability, validity, and the availability of representative normative data.

The Preschool Language Scale, 3rd Edition (PLS-3) met the relaxed reliability criterion for all age groups except children between 0 and 8 months of age; the Clinical Evaluation of Language Fundamentals, 3rd Edition (CELF-3) met the relaxed criterion for total score but not for composite scores.

With the exception of the Spanish version of the PLS-3, all instruments provided normative data derived from nationally representative populations. The CELF-3 (Spanish version) derived norms representative of the US Hispanic population.

Only the developers of the TOLD-P:3 and TOLD-I:3 provided evidence of the reliability and validity for use with four of the five populations specifically targeted by the SSA.

Adult Speech Instruments

None of the adult speech disorder instruments met the standards of evidence we established for both reliability and validity. The Stuttering Severity Instrument for Children and Adults, 3rd Edition (SSI-3), however, met the validity criterion.

No instrument met normative data standards. Although normative data were available for the SSI-3 and the Assessment of Intelligibility in Dysarthric Adults (AIDS), these data had been derived from individuals treated at single institutions. Instrument developers provided insufficient information to assess whether these patients were representative of adults with speech disorders.

Child Speech Instruments

Neither the Goldman-Fristoe Test of Articulation, 2nd Edition (GFTA-2) nor the SSI-3 met our relaxed criteria for reliability and validity. The GFTA-2 met our relaxed criterion for internal consistency reliability. Developers of both instruments employed nonstandard statistical methods to test other forms of reliability.

GFTA-2 provided normative data derived from nationally representative populations; the SSI-3 also provided normative data but gave no information on its representativeness.

Voice Instruments

Both the Voice Handicap Instrument (VHI) and the Kay Elemetrics Multi-Dimensional Voice Program (MDVP) met our criteria for reliability, validity, and availability of normative data.

Prediction of Future Communicative Functioning

We found only four studies providing evidence about prediction of future functioning; thus, we consider the evidence incomplete on this point. Of the 18 instruments we reviewed, information on predictive validity was available for only four -- one for adult language disorders, two for child language disorders (but not for versions directly reviewed in this report), and one for child speech disorders. None of the instruments we reviewed for either adult speech disorders or voice disorders had evidence of predictive validity.

Future Research

Further research is needed to evaluate and demonstrate the reliability, validity, and availability of normative data for instruments used to assess speech and language functioning and disorders. Instrument developers must be encouraged to document all types of instrument reliability (internal consistency, test-retest or intra-rater, and inter-rater reliability) and validity (content, construct, and concurrent validity) and to use currently accepted statistical procedures for psychometric analyses. Normative samples need to be representative of the population(s) of interest and of sufficient size that instruments can be shown to provide valid, interpretable results.

Funding agencies can facilitate this process by providing resources for the development and validation of new and existing instruments. Likewise, journal editors can help by encouraging the submission of reports on instrument reliability and validity, identifying peer reviewers who are qualified to evaluate the quality and rigor of these types of reports, and then publishing such data in their journals.

With the increasing cultural, linguistic, and racial diversity of the US population, the applicability of assessment instruments to individuals who are members of different subpopulations is of crucial importance to clinical diagnosis and the process of disability determination. Despite the existence of a large number of speech and language assessment instruments, we still lack appropriate instruments for reliably and validly assessing speech and language in many subgroups defined in terms of language, dialect, or cultural differences. Thus, future research funding and priorities should be directed at addressing these serious deficiencies. Funding sources should encourage research teams that represent collaborations among professionals with expertise in speech and language disorders, cultural experts for the demographic subpopulations of interest, professionals with expertise in disorders that often co-occur with speech and language impairment, and psychometric experts.

In addition to demographic subpopulations, research is needed on the applicability of speech and language assessment instruments for assessment of individuals with different disorders, such as severe physical impairment, mental retardation, learning disorders, and hearing impairment. Including representative numbers of members of these subgroups in normative samples during instrument standardization is important, but improving the evidence base requires analyses examining reliability and validity of instruments for subpopulations, not just for the total normative sample. Researchers and instrument developers should be encouraged to fill this gap.

Further, large-scale research also is needed on the ability of speech and language assessment instruments to predict future performance. Such investigations should not be limited to the predictive value of instruments in assessing specific intervention programs or in predicting future performance of a restricted subgroup. Rather, in terms of concern about disability, prediction of future test performance and future adaptive performance in everyday life is also critical. Such a "real world" research agenda would not only assist the SSA in decisions about disability but also contribute to the "ecological validity" of all speech and language assessments. We need both more instruments providing direct measurement of activity limitations and participation restrictions and more research demonstrating the relationship between speech and language impairment and activity limitations or participation restrictions.

Information on costs and burden to patients and to those in health care delivery settings should also be assembled, as it will likely be valuable in helping SSA or clinicians to select among otherwise seemingly similar instruments. A related area for future research is to compare the relative sensitivity and specificity of different approaches to disability determination for different types and degrees of speech and language impairment and to determine when the relative costs and benefits justify the addition of standardized instruments to the assessment process rather than relying solely on clinical judgments.

Important future research in this area includes investigation of the societal costs of speech and language disorders and the societal benefits of treating them. A good deal of work is needed simply on amassing data on costs of illness and costs of treatment. Combined with better information on efficacy and effectiveness of treatment, as called for above, such information would help researchers, clinicians, and policymakers better understand the cost-effectiveness of alternative therapeutic modalities.

Virtually no literature is available on the adverse effects or harms of diagnostic testing or disability evaluation. We urge that researchers take a broader perspective on the investigation of speech and language instruments, so as to shed some light on the likelihood that adults or children may be mislabeled (in both positive and negative ways) and on the consequences of such labeling.

Finally, we see a rich portfolio of research concerning appropriate ways to manage speech, language, or voice disorders in both adults and children. A necessary part of such investigations involves tracking patients' progress over time, and obviously the types of instruments reviewed here could play a part in such outcomes assessments. However, the deficiencies in many of these popular and well-known instruments need to be addressed before they can be used with confidence in treatment trials or studies. Apart from the basic measurement issues, methodological work is needed on the responsiveness of these instruments (that is, on their sensitivity to change and on the calculation of appropriate effect sizes that reflect change over time for individuals and groups). One strategy for those engaging in or supporting research on the management of patients with speech and language disorders is to build solid methodological research directly into treatment and rehabilitation studies, thereby strengthening both the given studies and the measurement field as a whole.

Chapter 1. Introduction

The purpose of this report is to assist the Social Security Administration (SSA) in reviewing its criteria for determining disability in individuals with speech disorders, language disorders, or both. The statutory definition of disability in adults is "an inability to engage in any substantial gainful activity by reason of any medically determinable physical or mental impairment which can be expected to result in death or which has lasted or can be expected to last for a continuous period of not less than twelve months.''1 For children and adolescents, the definition is "(i) An individual under the age of 18 shall be considered disabled for the purposes of this title if that individual has a medically determinable physical or mental impairment, which results in marked and severe functional limitations, and which can be expected to result in death or which has lasted or can be expected to last for a continuous period of not less than 12 months. (ii) Notwithstanding clause (i), no individual under the age of 18 who engages in substantial gainful activity may be considered to be disabled."1

These definitions make clear that across the age span considered for disability claims (i.e., birth to 62 years of age), disability determination depends on the functional limitations that an individual experiences, with respect to either employment in adults or major life activities of children or adolescents (for example, school or play). Therefore, in evaluations of individuals with speech and language disorders, the SSA is concerned with the concurrent relationship between the degree of impairment as measured by the assessment instrument and functional limitations associated with the speech or language impairment.

Table 1. Key Questions
Key QuestionCore Elements
  • What evaluation procedures for child and adult speech (voice, articulation/intelligibility, fluency) and language disorders have been demonstrated to have the salient characteristics of a good diagnostic tool (e.g., reliability, validity, appropriate normative data, responsiveness) for individuals who are/have:

  • Are there evaluation procedures that have been demonstrated to have predictive validity for the individual's communicative impairment, performance, or both?

  • English-speaking, have normal hearing, with or without normal cognition?

  • Non-English-speaking, have normal hearing, with or without normal cognition?

  • Mentally retarded?

  • Learning disorders?

  • Hearing impaired (i.e., hard of hearing)?

Another commonality in the definitions of disability in children and adults is that the disability must be expected to last for at least 12 months or to result in death during that period. This criterion leads to a second important concern for the SSA, which is to know what evidence is available for various speech and language assessment instruments regarding their predictive power for future functioning of an individual. Based on concerns related to the criteria and process for determining disability in children and adults, the SSA nominated two key questions as the basis for this report. Table 1 provides full specification of the key questions.

Disability Associated with Speech and Language Disorders

Epidemiology and Costs of Speech and Language Disorders

According to the National Institute on Deafness and Other Communication Disorders (NIDCD), approximately 42 million people (1 in 6) in the United States have some type of communication disorder.2 Of these, 28 million have communication disorders associated with hearing loss, and 14 million have disorders of speech, voice, and/or language not associated with hearing loss.

The personal and societal costs of these disorders are high. On a personal level, such disorders may affect nearly every aspect of daily life. Estimates of annual societal costs in the United States range from $30 billion3 to $154 billion4 in lost productivity, special education, and medical care.

Speech Disorders

A speech disorder is a disorder affecting the articulation of speech sounds, the fluency with which speech is produced, or the quality of the voice. Articulation disorders (also called phonological disorders) include motor speech disorders and functional articulation disorders. Motor speech disorders result from damage to the central or peripheral nervous system. Damage may occur as the result of strokes, traumatic brain injury, or neurogenic diseases including Parkinson's disease, Huntington's disease, and amyotrophic lateral sclerosis; among children, the problems can arise from any of a range of prenatal, perinatal, and postnatal conditions, particularly those resulting in cerebral palsy.

Functional articulation disorders are those that either have no known cause or result from causes other than known neurological insults or physical abnormalities. In the majority of cases in children, articulation disorders fall into this category.5 They may stem from problems with the motoric component of speech production or from an internal representation of the phonological rule system of the target language that is immature or disordered. Among preschool and school-age children, articulation disorders are the most prevalent communication disorders, affecting approximately 10 percent of the population, and they are of sufficient severity to require treatment in 8 percent of the population. Among children with articulation disorders, 50 percent to 70 percent exhibit academic difficulties throughout the primary and secondary grades, reflecting at least in part the demonstrated relationship between early phonological disorders and later reading, writing, spelling, and mathematical achievement.

Long-term consequences can persist throughout the lifespan. Studies of adults who were diagnosed and treated for articulation disorders as children have revealed continuing difficulties in processing linguistic information, even though they seldom continued to show overt difficulties with speech sound production. These individuals are less likely to attend college and more likely to hold jobs that involve unskilled labor than their peers without a history of phonological disorder.5

Fluency disorders, also referred to as stuttering, involve an interruption in the flow of speaking manifested as an atypical rate, rhythm, repetitions in sounds, syllables, words, and phrases, or some combination of these. Secondary symptoms can include excessive tension, struggle behaviors, and odd behavioral mannerisms.3 Approximately 1 percent of the population (more than 3 million Americans) exhibits a fluency disorder that has persisted beyond 6 years of age.6 Children who stutter have a poorer educational adjustment and lower achievement than their peers who do not stutter. The disorder likely is vocationally handicapping as well, given the negative stereotypes of people who stutter and the fact that employers believe that stuttering decreases employability.6

Voice disorders are characterized by abnormal pitch, loudness, resonance, quality, or duration of voice, or by an inability to use one's voice, or some combinations of these factors. These disorders result from abnormal laryngeal, respiratory, or vocal tract functioning. They may be caused by habits of vocal misuse and hyperfunction (e.g., repeated clearing of one's throat, or prolonged talking over background noise) that produce physical changes in the vocal folds, by medical conditions (e.g., trauma, neurological disorders, allergies, or cancer), by psychological disorders (e.g., stress or personality disorders), or by a combination of these factors.7

Between 3 percent and 9 percent of the population of the United States has a voice disorder. Of the total working population in the United States, approximately 25 percent have jobs that critically require voice use.3 The majority of individuals diagnosed with voice disorders report that their voice problems have negatively affected past, current, and future job performance. Individuals in certain vocations and avocations, including teachers, singers, actors, cheerleaders, and aerobic exercise instructors, are particularly susceptible to voice disorders.7

Language Disorders

A language disorder is the impaired comprehension and/or use of spoken, written, and/or other symbol systems used for communication. Between 6 and 8 million individuals in the United States have some form of language impairment.3 Approximately 1 million of these are adults with aphasia, an acquired impairment of language comprehension and/or expression caused by brain damage, usually secondary to strokes. A large proportion of the 2 million adults with progressive dementing diseases (e.g., Alzheimer's disease, Parkinson's disease) have significant language impairments. In addition, language impairments persist among adults who failed to develop normal language skills because of developmental or acquired disorders in childhood (e.g., specific language impairment, autistic disorder, hearing impairment). Approximately 8 percent to 12 percent of preschool children have some form of language impairment.

Specific language impairment (SLI) is defined as a significant deficit in language functioning that is not accompanied by any deficits in hearing, intelligence, or motor functioning that would explain the language deficit. Developmental language disorders tend to concentrate within families,8,9 and genetic factors have been implicated.10 The overall prevalence of SLI among kindergarten students is estimated at approximately 7 percent.11

Persistence of language impairment across time is more likely among children initially diagnosed with both receptive and expressive language impairments (65 percent to 100 percent with persisting disorders, as reported across studies) than among children initially diagnosed with expressive language impairments only (0 percent to 54 percent with persisting problems, as reported across studies). Children identified with language disorders as preschoolers are at great risk for learning disabilities at school age, and the vast majority of children identified at school age as learning disabled have concomitant language disorders. One study reported a prevalence rate of 90.5 percent for language disorders among 242 children between 8 and 12 years of age who had learning disabilities.12

The rate of comorbidity of psychiatric and communication disorders in children is high, particularly for children diagnosed with attention-deficit/hyperactivity disorder (ADHD).13-15 The comorbidity of these two types of disorders is frequently unsuspected. Cohen and colleagues found that of 399 children referred for psychiatric outpatient treatment, 25 percent had language impairments that were previously unsuspected, almost equal to the number with previously identified language impairments.16 The children with previously unsuspected language impairments had the most severe externalizing behavior problems (problems with adverse effects on property or other people), compared to children with identified language disorders or those without language disorders.

Technical and Measurement Issues for Evaluation Tools

Important clinical decisions follow from the assessment of a person with a communication disorder. This assessment includes the nature of the disorder, the degree of impairment, the impact of the disorder on the individual's daily functioning and quality of life, the medical necessity of treatment, and the long-term prognosis for the individual's functional level. These clinical decisions affect an individual's access to services and funding (e.g., third-party payer coverage of treatment, eligibility for special education services, Social Security disability income). Thus, the quality of the evaluation procedures on which such decisions are based is an important issue for individuals with a communication disorder, the clinicians involved in their evaluation and treatment, and the policymakers with fiscal responsibilities for services to individuals with these disorders. This evidence report is directed to audiences who must grapple with this set of issues.

Previous reports on the quality of speech and language evaluation procedures are limited.17,18 The lack of evaluations of the quality of these procedures is understandable when the complexities and realities of speech and language disorders are considered.

Technical Issues Related to Speech and Language Disorders

First, speech and language disorders are apparent across the lifespan, but in many cases the manifestations of the disorders vary at different ages or stages of development. These changes require clinicians and others to apply different evaluation procedures appropriate to the assessment of the disorders at different developmental and chronological stages.

Second, speech and language disorders are diverse in terms of both affected functions and etiology. With respect to affected functions, for instance, broad categories include disorders of speech sound production, voice, fluency, and language, as described above. Within the broad category of language, interrelated subcomponents including semantics (language meaning and vocabulary), syntax (phrase and sentence structure), and pragmatics (appropriate use of language in context) may be differentially affected.

Finally, language comprehension and production abilities will not necessarily be commensurate in a single individual. In addition, communication disorders will differentially affect abilities in different communication and language modalities, such as gesture or sign language, spoken language, and written language.

For etiology, clinicians distinguish between developmental and acquired disorders of speech and language. Developmental communication disorders are those apparent from early on in a child's life, such that speech or language (or both) do not develop as expected. Although these developmental disorders first appear in childhood, impaired communication affects many individuals with these disorders throughout their lives.

Acquired disorders are those that affect an individual (either child or adult) who already has an intact speech and language system or who was progressing normally in the development of such a system. These disorders usually result from neurological damage following such events as strokes or closed head injury, although they also can arise from other types of disease or events (e.g., laryngeal cancer, accidents affecting oral motor structures).

Other factors contribute to the complexity of evaluating speech and language disorders as well. First, speech and language impairments often coexist with and are related to other problems, such as hearing loss, mental retardation, ADHD, autism, cerebral palsy, Parkinson's disease, dementia, hemiplegia, and many others. These coexisting conditions often require that clinicians adapt standard speech and language evaluation procedures. In some cases, speech and language evaluation instruments have been developed specifically for individuals with particular coexisting disorders; a case in point is the Rhode Island Test of Language Structure developed for children with hearing impairments by Engen and Engen.19

Another major factor, which is increasingly important with the changing demographics in the United States, is that individuals being evaluated for speech and language disorders may speak a nonstandard dialect of English, or they may not be from an English-speaking cultural environment. For example, 24.4 percent of children in kindergarten to grade 3 had limited English proficiency in 1990.20 An estimated 6.2 million culturally and linguistically diverse Americans representing different minority groups in the United States have a communication disorder. To determine the presence, nature, and severity of a speech and language disorder, these individuals must undergo an evaluation that is appropriate for both their language and their culture. However, the task of identifying valid and reliable speech and language assessment instruments and procedures for people from different linguistic and cultural environments is a challenging one for clinicians.

Measurement Issues

Assessment and diagnosis of individuals with suspected speech and language disorders is a process that involves posing a series of clinical questions and choosing the appropriate procedures and instruments to help answer those questions. Rarely, if ever, will a single assessment procedure or instrument be sufficient to ascertain a diagnosis, severity level, prognosis, and treatment recommendations for a speech or language disorder.

Posing clinical questions and choosing assessment procedures and instruments require clinical expertise and often will call for multidisciplinary assessments to clarify the nature of the speech or language disorder, its impact on functioning, and the long-term prognosis for the patient. Findings on standardized measures must be interpreted in light of other important clinical information. For example, two patients who have suffered from traumatic brain injuries may show similar performance profiles on a language measure, but their long-range prognoses could be very different depending on the variable such as the elapsed time interval since injury, the presence or absence of a seizure disorder, or the possible impact of medication on language functioning. Ultimately at issue is the reliability and validity of the entire clinical process in yielding accurate results and sound conclusions pertaining to the possible speech or language disorder.

During the assessment process, clinicians can use a wide variety of procedures and instruments. In almost every case, the assessment will involve an interview and a written questionnaire to obtain the clinical history of the patient. The interview may follow a structured format, or it may have a dynamic format that depends on the patient's responses. Standardized written questionnaires are available to aid in the evaluation of some disorders. In some cases, the patient completes the questionnaires; in others, an informant who is knowledgeable about the patient completes it.

Many standardized instruments are available to assess disorders of speech sound production and of language. In these instruments, the individual is asked to respond to a series of questions or tasks designed to elicit particular speech or language targets or to test perception or comprehension of particular targets. The examiner scores responses against standardized criteria for accuracy or appropriateness.

Another widely used procedure is observation of the patient's communication behaviors and the physiological or neurophysiological structures that underlie those behaviors. The observation may be recorded in some way (e.g., audio- or video-recording) for later analysis. The observed behavior can be analyzed in numerous ways, depending on the nature of the suspected disorder. For instance, if a child is suspected of having a disorder of speech sound production, then a sample of the child's conversational speech might be analyzed for percentage of correct consonants or evaluated for the percentage of speech that is intelligible to an unfamiliar listener. If an individual is suspected of having a fluency disorder, his or her spontaneous speech may be rated using an observational scale such as the Iowa Scale for Rating Severity of Stuttering.21 For someone who seems to have a problem with language production, the sample might be analyzed for variables such as mean length of utterance, diversity of vocabulary, or use of a variety of linguistic structures.

In some cases observing the structures or functions of interest is difficult, if not impossible, without the use of special instrumentation. For example, a computer-based system might be used to determine fundamental frequency, shimmer, and jitter for the voice of an individual who is being evaluated for a possible voice disorder. For the same individual, videostroboscopy might be used to observe the structure and function of the vocal folds. For an individual with a language disorder, structural or functional brain imaging (or both) may be a valuable tool for clarifying etiology, severity, and prognosis.

Measurement strategies and instruments will vary depending on the levels at which the disability is being evaluated. The recent terminology drafted by the World Health Organization (WHO) uses "impairment" to refer to problems in body function or structure, including psychological or linguistic functioning.22 "Activity limitations" refer to difficulties an individual may have in executing activities, and "participation restrictions" refer to problems an individual may experience in involvement in life situations. Traditionally, the focus of measurement has been the level of impairment of speech or language functions.

Contemporary practice, however, emphasizes assessment at the level of activity limitations and participation restrictions associated with speech and language impairments. Some standardized instruments have been developed to measure disability at these levels. For example, the Assessment of Intelligibility of Dysarthric Speech by Yorkston and Beukelman23 and the Functional Assessment of Communication Skills for Adults24 are measures that largely address speech or language disabilities at the level of activity limitations, whereas the Voice Handicap Index25,26 measures aspects of impairment, activity limitations, and participation restrictions.

The range of disorders, ages, and cultural or linguistic factors described above has prompted the development of a vast array of assessment instruments for speech, voice, and language. One source reviewing commercially available assessment instruments includes more than 140 tools in its most recent edition.27 Given the multiplicity of tools and the complexity of factors involved, the lack of a clear understanding of which speech and language evaluation procedures yield reliable and valid results is not surprising. Assessing individuals from English-speaking cultural environments who have no coexisting disorders is difficult; assessing individuals who have coexisting disorders or who come from non-English-speaking cultural environments is even more challenging.

Nevertheless, the quality of the speech and language evaluation procedures is of paramount importance for sound clinical decisionmaking and robust policies related to services for individuals with speech and language disorders. This report will clarify the existing knowledge base for a representative selection of speech, language, and voice assessment instruments used with children and adults, and it also will point the directions for future research.

Organization of the Report

The remainder of this evidence report is organized in the following sections. Chapter 2 provides details about the process we used to select the instruments and our literature search and review methodology. Specifically discussed are the development and modification of key questions; the process for selection of the 18 instruments reviewed; our literature review and retrieval process, including electronic searches and abstract review, data abstraction from articles, and quality control procedures; and the application of a scheme of quality rating. Chapter 3 presents the results of our analyses by population and disorder (i.e., adult language disorders, child language disorders, adult speech disorders, child speech disorders, and voice disorders). Chapter 4 reflects on the results of our review and the conclusions that can validly be drawn to answer the SSA questions. It also includes an analysis of the usability of the selected instruments. Chapter 5 offers our recommendations for a research agenda on the development and validation of instruments to evaluate speech and language disorders in adults and children. References and Evidence Tables 1-72 with supporting information and a glossary follow the main text.

The appendices provide acknowledgments (Appendix A), information on our technical expert advisory group (TEAG) (Appendix B), the peer reviewers for this report (Appendix C), and a detailed description of our methodology (Appendix D).

Chapter 2. Methodology

This chapter documents the procedures that the RTI-University of North Carolina at Chapel Hill Evidence-based Practice Center (RTI-UNC EPC) employed to develop a comprehensive evidence report for the Agency for Healthcare Research and Quality (AHRQ) and the Social Security Administration (SSA) on the criteria for determining disability in individuals with speech and/or language disorders. This report evaluates the reliability and validity of 18 instruments commonly employed by speech and language clinicians to diagnose the presence and severity of speech and language disorders. Furthermore, we present available evidence of the predictive validity for future communicative functioning for these instruments.

To establish a context for the evidence report, we first describe the key questions addressed by the report, the causal or clinical pathway that underpins these questions, and the process by which we selected the 18 diagnostic instruments for review. We then describe our literature review and retrieval process. This process included identifying appropriate electronic databases, developing appropriate search terms or key words, searching the gray literature (limited to instrument manuals), defining inclusion and exclusion criteria, and reviewing abstracts and complete articles for inclusion. Next, we describe the process by which we abstracted relevant data using one of two data extraction forms (one for peer-reviewed literature, the other for instrument manuals) and compiled relevant data into evidence tables used to summarize the information.

We also describe our process for evaluating the quality of the reviewed articles and instrument manuals and the evidence they yield. Because the contents of the instrument manuals and the articles from the peer-reviewed literature vary tremendously, we could not create a single quality rating form; thus, we outline our process for developing two forms (one for manuals, the other for peer-reviewed literature). Also discussed in this section is the process by which we assured quality control in selecting articles for review, abstracting data and creating evidence tables, and for evaluating quality. The chapter closes with a discussion of the peer review process and a supplemental analysis of the usability of the selected instruments and their manuals. The Methods Appendix supplements this description of the methods we employed.

Key Questions, Causal Pathway, and Selected Instruments

We developed preliminary key questions and a causal pathway in response to the initial AHRQ/SSA request for proposal. We refined these conceptual issues and selected instruments for review during a one-day meeting (September 18, 2000, in Rockville, Maryland). Meeting participants (Appendix D), included members of our Technical Expert Advisory Group (TEAG) (described more fully in Appendix B), individuals with clinical expertise in speech, language, or voice disorders and relevant medical specialties, and representatives of professional societies and health care systems. They provided input on the utility and appropriateness of the causal pathway, refined the key questions, and identified and prioritized evaluation tools to be included in the evidence analysis.

Key Questions

Final key questions are as follows:

  1. What evaluation procedures for child and adult speech (voice, articulation/intelligibility, fluency) and language disorders have been demonstrated to have the salient characteristics of a good diagnostic tool (e.g., reliability, validity, appropriate normative data, responsiveness) for individuals who are/have:

    • English-speaking, have normal hearing, with or without normal cognition?

    • Non-English-speaking, have normal hearing, with or without normal cognition?

    • Mentally retarded?

    • Learning disorders?

    • Hearing impaired?

  2. Are there evaluation procedures that have been demonstrated to have predictive validity for the individual's communicative impairment, performance, or both?

Causal Pathway

An external file that holds a picture, illustration, etc., usually as some form of binary object. The name of referred object is f3710_F001.jpg.

   Figure 1. Causal Pathway for Determining Disability in Speech and Language Disorders

Figure 1, our final causal pathway (i.e., analytical framework or conceptual model), depicts the scope of our evidence report. It begins with referral of an individual with a suspected disorder for evaluation. The subsequent evaluation of the individual yields a diagnosis, a determination of the severity of the disability or impairment, and an estimation of prognosis (prediction of duration of impairment and the expected level of functional communicative skills that will ultimately be gained or regained). These, in turn, serve as input to the determination of disability according to SSA guidelines.

Our evidence report begins with the speech-language evaluation and stops short of the SSA determination of disability per se (which may involve consideration of other impairments coexisting with the speech and/or language impairment). We focus on speech and language evaluation procedures and prediction of outcomes based on these procedures. In particular, we have evaluated the following:

  • Reliability (internal consistency, test-retest, inter-rater, and intra-rater);

  • Validity (content, construct, concurrent [convergent and divergent/discriminant], and predictive for future communicative functioning);

  • Clinical test parameters (sensitivity, specificity, positive predictive value, negative predictive value), where available;

  • Responsiveness; and

  • Availability of normative data for the evaluation instruments.

In the causal pathway, an individual with suspected speech or language disorder is evaluated using selected evaluation procedures. Clinicians today use many different evaluation procedures to evaluate speech and language disorders in adults and children. Decisions regarding which procedure(s) to employ are now left to the discretion of the health care professional(s) conducting the evaluation.

The choice of evaluation instrument(s) will be determined by five sets of factors:

  1. Whether the individual is a child or an adult;

  2. The type of disorder (speech sound production, voice, stuttering, or language -- including consideration of semantics, syntax, pragmatics, comprehension, production, and language modality);

  3. The presence of other impairments (e.g., cognitive impairments, hearing loss, physical disability); and

  4. Whether the individual is from a non-English-speaking cultural environment;

  5. Contextual factors, including examiner training and expertise, and setting for evaluation and/or treatment.

Characteristics of these instruments are critical to a successful disability determination process. Evaluation instruments must be reliable and valid for the purposes for which they are employed. If the clinical decisions based on their use are to be sound and objective, then evaluation instruments must be reliable and valid for the purposes for which they are employed. Such purposes include making a diagnosis, determining degree of impairment, and/or having predictive validity for short-term and long-term impairment and/or functional communicative outcomes.

Clinicians and technical experts synthesize information gained from the various procedures and instruments employed in the speech and language evaluation. In this process, they also may incorporate information from other evaluations not directed at the speech and language disorder per se. On the basis of all this information, they will arrive at conclusions regarding diagnosis, severity level, prognosis, and treatment recommendations.

At the final step in the causal pathway, conclusions from the speech-language evaluation are considered in light of the SSA rules for determination of disability (cited in Chapter 1). For all individuals applying for disability benefits under Title II, and for adults applying under Title XVI, disability is defined as the inability to engage in any gainful activity by reason of any medically determinable physical or mental impairment(s) that can be expected to result in death or that has lasted or can be expected to last for a continuous period of not less than 12 months. Children under the age of 18 applying under Title XVI are considered disabled if they have a medically determinable physical or mental impairment that causes marked and severe functional limitations that can be expected to cause death or that have lasted or can be expected to last for a continuous period of not less than 12 months. An individual may be considered disabled on the basis of severe speech-language impairment alone, or the individual's speech-language impairment may be considered in combination with other impairments in determining disability status.

Selection of Instruments

Because of the diversity of speech and language disorders being considered, many evaluation tools/procedures provisionally fall within the scope of the key questions. We know of no single source that enumerates all speech and language evaluation instruments. The 140 commercially available speech and language instruments represent only a subset of all the evaluation tools available.27 The task of synthesizing the available evidence on all speech and language evaluation procedures was clearly too large an undertaking to complete within the scope of this project. Thus, a critical step in this evidence review was to select and then set priorities for instruments in such a way as to address the important informational needs of the SSA while also limiting the scope to fall within the contractual boundaries of the project.

During the September 18, 2000 meeting, we asked participants to select instruments and to set priorities for this review. We provided meeting participants with a list of speech-language diagnostic instruments as a reference during the selection process. We reminded participants that, because evidence-based medicine considers evidence from the peer-reviewed literature to be of paramount value or importance, they should suggest only instruments for which this type of evidence would be available. Thus, instruments that might otherwise be considered standards in the field (e.g., the American Speech-Language-Hearing Association [ASHA] Functional Assessment of Communication Skills24) and for which reliability and validity data are not published in the peer-reviewed literature would be excluded from further consideration.

Using this criterion, participants nominated 39 tools in each of five categories adult language, child language, adult speech, child speech, and voice disorders. Because we could not have conducted systematic literature reviews and evidence analyses for each of the 39 tools elicited given the project timeline and resources, we asked meeting participants to set priorities for the tools within the five categories, selecting three tools in each. Meeting participants set priorities based on the following guiding principles: (1) tools must assess disorders broadly rather than selecting a single aspect of the disorder, (2) for children, instruments must be useful with a broad age range, and (3) there should be a balance between instruments elicited and observed behaviors.

Table 2. Instruments Selected for Review, by Disorder Category
Disorder and Instrument
Adult Language
Boston Diagnostic Aphasia Examination, 2nd Edition (BDAE-2) Porch Index of Communicative Ability (PICA) Western Aphasia Battery (WAB)
Child Language
Clinical Evaluation of Language Fundamentals, 3rd Edition (CELF-3) CELF-3 Spanish Edition (CELF-3Sp) CELF-3 Preschool (CELF-P) Test of Language Development-Primary, 3rd Edition (TOLD-P:3) Test of Language Development-Intermediate, 3rd Edition (TOLD-I:3) Preschool Language Scale-3 (PLS-3) Preschool Language Scale-3 Spanish Edition (PLS-3Sp) Test of Pragmatic Language (TOPL)
Adult Speech
Assessment of Intelligibility in Dysarthric Speech (AIDS) Dysarthria Examination Battery (DEB) Stuttering Severity Instrument for Children and Adults, 3rd Edition (SSI-3)
Child Speech
Goldman-Fristoe Test of Articulation, 2nd Edition (GFTA-2) Stuttering Severity Instrument for Children and Adults, 3rd Edition (SSI-3) Phonological process analysis (PPA)
Voice
GRBAS (grade, rough, breathy, asthenic, strain) Scale Multi-Dimensional Voice Profile (MDVP) Voice Handicap Index (VHI)
From these 39 instruments, participants selected a total of 20 instruments for literature review and evidence analysis (Table 2) -- three each for adult language, adult speech, child speech, and voice, and eight for child language disorders. One instrument selected covers both adult and child speech disorders and thus is counted twice. After consultation with TEAG members in December 2000 and with colleagues in the Division of Speech and Hearing Sciences at UNC, we excluded phonological process analysis because it is not a single instrument. Rather, it is an approach to conducting a more comprehensive clinical analysis of phonological patterns for which standard "diagnostic test characteristics" would be difficult to determine.

Literature Search

This section describes the literature search procedures, specifying inclusion/exclusion criteria utilized to select literature for review, search terms and strategies, electronic databases employed, and our limited gray literature search. We document the steps used to identify the articles and instrument manuals that we ultimately reviewed.

Inclusion and Exclusion Criteria

Table 3. Criteria for Determining Disability in Speech and Language Disorders: Study Inclusion/Exclusion Criteria
CategoryCriteria
Study PopulationHumans, Children, Adolescents, and Adults (ages 18 to 62)
Evaluation ToolsAdult Language: Boston Diagnostic Aphasia Examination, 2nd Edition; Porch Index of Communicative Ability; Western Aphasia Battery Child Language: Clinical Evaluation of Language Fundamentals, 3rd Edition (CELF-3); CELF-3 Spanish Edition; CELF-3 Preschool; Test of Language Development-Primary, 3rd Edition; Test of Language Development-Intermediate, 3rd Edition; Preschool Language Scale, 3rd Edition (English); Preschool Language Scale, 3rd Edition (Spanish); Test of Pragmatic Language Adult Speech: Assessment of Intelligibility; Dysarthria Examination Battery; Stuttering Severity Instrument for Children and Adults, 3rd Edition Child Speech: Goldman-Fristoe Test of Articulation, 2nd Edition; Stuttering Severity Instrument for Children and Adults, 3rd Edition; Phonological Process Analysis Voice: GRBAS (Grade, Rough, Breathy, Aesthenic, Strain) Scale; Multidimensional Voice Profile; Voice Handicap Index
Study SettingInpatient and outpatient settings, communities, schools
Outcomes MeasuredReliability (internal consistency, test-retest, inter- and intra-rater) Validity (content, construct, concurrent [convergent and divergent/discriminant] Predictive validity (for future communicative functioning), and normative data for particular populations (children or adults ages 18-62) Cognitive impairments (normal, borderline normal, mentally retarded, learning disabled) Hearing impairments, language, or cultural issues (English- versus non-English speaking) Exclude studies evaluating the efficacy or effectiveness of interventions/treatment of speech and/or language disorders
Time PeriodDepends on when the evaluation tool(s) were developed and validated
Geographic Site of StudyExclude based on language of publication. Normative data must be from the United States
Publication LanguagesEnglish only
Admissible Evidence (Study Design and Other Criteria)Randomized controlled trial (RCT) -- double, single-blinded, and cross-over Non-RCT -- nonequivalent control group designs, cohort studies (prospective and retrospective), case-control studies, psychometric evaluations Other designs (meta-analysis, meta-regression, cross-design synthesis, review article for reference list searches) Sample size: >20 subjects per analysis group
Inclusion/exclusion criteria (Table 3) were developed for and revised slightly during the September 18, 2000, meeting. Essentially, we included all research on the selected instruments in children and adults (ages 18 through 62) in which the study evaluated the instrument's reliability, validity, or its ability to predict future communicative impairment and/or functioning (i.e., predictive validity).

We included studies of individuals older than 62 years if the majority of the study sample was age 62 and younger. We excluded studies that: (1) concerned solely elderly adults (i.e., individuals > 62 years of age); (2) were published in languages other than English; and (3) did not report information related to the key questions (except to the extent that they are used to provide background information).

We excluded articles reporting the efficacy or effectiveness of speech or language therapy that did not provide information relevant to the key questions. We also excluded articles providing normative data from populations other than the United States because of our need to address issues facing the SSA in establishing disability criteria in the United States.

Search Terms and Databases

Table D3. Search Terms Employed in the Literature Review
MeSH Search Terms and Key Words
Exploded: Study Designs: study design, study characteristics, randomized controlled trial [publication type], single-blind method, double-blind method, random allocation, cross-over, case-control studies, retrospective cohort, longitudinal studies, outcomes Disorders: Language development disorders, language disorders, speech disorders, child language disorders, adult language disorders, voice disorders, articulation disorders, dysarthria, stuttering, aphasia, apraxias, developmental apraxia Tests:a Phonological Process Analysis, Clinical Evaluation of Language Fundamentals, Preschool Language Scale, Goldman-Fristoe, Western Aphasia Battery, Test of Pragmatic Language, Test of Language Development, Stuttering Severity Index, Assessment of Intelligibility for Dysarthric Speech, Dysarthria Examination Battery, Boston Diagnostic Aphasia Examination, Voice Handicap Index, Multidimensional Voice Profile, GRBAS, Porch Index of Communicative Ability Primary Outcomes: Predictive value of tests, sensitivity and specificity, reproducibility of results, reliability Other: Quality of life, activities of daily living, functional status, outcomes and process assessment (health care), outcomes assessment (health care), costs and cost analysis, cost-benefit analysis, epidemiologic study characteristics
a

Keyword searches were used for test names to achieve the broadest search possible. As a result, we often found nonrelevant citations because the search engine looks for words in proximity.

The EPC's information specialist, in consultation with EPC staff, developed search terms based on the names of the selected instruments. Preliminary searches were conducted using broad Medical Subject Headings (MeSH) relating to study design, disorders, and outcomes in combination with the names of the instruments (Table D3 in Appendix D). These searches yielded enormous numbers of citations; very few were relevant. Consequently, we revised our searches to use the names of the instruments as key words, sometimes searching on only the most important term (e.g., Goldman-Fristoe). This strategy, while producing large numbers of citations, was most likely to ensure that we did not miss possibly relevant citations.

We employed a multifaceted approach to identify relevant studies, including use of standard electronic (literature) databases, reference lists of relevant articles, and Cochrane Collaboration resources. We searched electronic data sources, including the MEDLINE, CINAHL, PsycLIT®, ERIC, Health and Psychosocial Instruments (HAPI), and the Cochrane Collaboration databases. The PsycLIT® database, produced by the American Psychological Association, covers the professional and academic literature in psychology and related disciplines including medicine, psychiatry, nursing, sociology, education, pharmacology, physiology, linguistics, and other areas. ERIC, produced by the Educational Resource Information Center, indexes and abstracts journal and report literature (1966 to the present) in education and related disciplines. HAPI provides information on measurement instruments (i.e., questionnaires, interview schedules, checklists, index measures, coding schemes/manuals, rating scales, projective techniques, vignettes/scenarios, tests) in the health fields, psychosocial sciences, organizational behavior, and library and information science.

We also reviewed the reference lists of all included articles and instrument manuals to identify additional studies. Finally, we conducted searches of the Cochrane Collaboration Database, a family of electronic databases providing reference information on randomized and other controlled clinical trials and systematic reviews conducted by various Cochrane Collaborative Review Groups. These studies were reviewed for quality, and in some cases we obtained additional information from the original authors or through hand-searches of the literature.

Gray Literature Search

We initially excluded all gray literature for two reasons: (1) the large number of selected instruments and the substantial literature for each of these tools, and (2) the time and resource constraints of this project. An additional issue was the danger that including gray literature studies would tend to identify studies with "memorable findings," which are more likely to be either very positive or very negative relative to studies that were not as well remembered. Incomplete sampling of the gray literature would introduce the possibility of bias in study selection.

However, review of abstracts for eligibility (see "Literature Retrieval" below) made it clear that, for many instruments, data on reliability and validity could be found only in the instrument manuals. In most cases, instrument developers had not published their validation or standardization results. Thus, we expanded our efforts to include instrument manuals in the evidence review process. For several instruments, we identified some reliability and validity data in doctoral dissertations (via a search of Dissertation Abstracts International). Unless the dissertations were later published either as instrument manuals or as peer-reviewed articles, we excluded them from the evidence analysis because of their relative inaccessibility or obscurity.

Literature Retrieval

Either the study director or the scientific director reviewed all abstracts for a single test, using the Abstract Review Form (Figure D1 in Appendix D). If the first reviewer judged an abstract to meet inclusion criteria, we included it in our review. The second reviewer independently reviewed all abstracts excluded by the first reviewer. If both reviewers excluded an abstract, we dropped the abstract from further consideration. If the reviewers disagreed, they met to reconcile differences between their reviews. The scientific director's decision stood when differences remained after reconciliation. Generally speaking, for all abstracts, we erred on the side of inclusion rather than exclusion, sending questionable abstracts to the other reviewer for further consideration.

Initially, we planned to re-review a randomly selected 20-percent subset of abstracts marked for inclusion. However, because the number of articles included was quite small, the clinical experts re-reviewed all the abstracts selected for inclusion.

Results of Article Selection

Table 4. Results of Literature Searches, Abstract Review, and Data Abstraction Processes, by Instrument
Steps in Literature Review ProcessAdult Language Instruments
BDAE-2PICAWAB
Total unduplicated records21180152
Sent for Abstract Review21180152
Include3910
Exclude20871142
Sent for Data Abstraction3910
Excluded at Data Abstraction154
Sent for Evidence Table Creation246
Peer-reviewed articles246
Manuals111
Evidence Tables Created351
 Child Language Instruments
CELF-3/ CELF-3Sp/ CELF-PTOLD-I:3/ TOLD-P:3/ TOLD-P:2/ TOLD-I:2TOPLPLS-3/ PLS-3Sp/ PLS-R
Total unduplicated records7225610112
Sent for Abstract Review7225610112
Include91706
Exclude6323910106
Sent for Data Abstraction91706
Excluded at data abstraction41502
Sent for Evidence Table Creation5204
Peer-reviewed articles5204
Manuals3212
Evidence Tables Created8416
 Adult Speech Instruments
AIDSDEBSSI-3a
Total unduplicated records21336
Sent for Abstract Review21336
Include101
Exclude20335
Sent for Data Abstraction101
Excluded at data abstraction101
Sent for Evidence Table Creation000
Peer-reviewed articles000
Manuals111
Evidence Tables Created111
 Child Speech Instruments
GFTA-2/GFTAPPASSI-3a
Total unduplicated records1378436
Sent for Abstract Review1378436
Include9121
Exclude1287235
Sent for Data Abstraction9721
Excluded at data abstraction5--- b1
Sent for Evidence Table Creation400
Peer-reviewed articles400
Manuals101
Evidence Tables Created501
 Voice Instruments
GRBASMDVPVHI
Total unduplicated records12133
Sent for Abstract Review12133
Include833
Exclude4100
Sent for Data Abstraction833
Excluded at data abstraction101
Sent for Evidence Table Creation732
Peer-reviewed articles732
Manuals010
Evidence Tables Created742
a

Stuttering Severity Instrument for Adults and Children-3rd Edition selected for both adults and children.

b

Phonological Process Analysis excluded at data abstraction.

nd

AIDS = Assessment of Intelligibility of Dysarthric Speech; BDAE-2 = Boston Diagnostic Aphasia Examination, 2 Edition; CELF-3 = Clinical Evaluation of Language Fundamentals-3rd Edition; CELF-3Sp = CELF-3 Spanish Edition, CELF-3 Pre = CELF-3 Preschool; DEB = Dysarthria Examination Battery; GFTA = Goldman-Fristoe Test of Articulation, 2 Edition; GRBAS = GRBAS (Grade, Rough, Breathy, Aesthenic, Strain) Scale; MDVP = Kay Elemetrics Multidimensional Voice Profile; PICA = Porch Index of Communicative Ability (PICA); PPA = Phonological process analysis; PLS-3 = Preschool Language Scale-3rd Edition; PLS-3Sp = Preschool Language Scale-3rd Edition (Spanish); SSI-3 = Stuttering Severity Instrument for Adults and Children, 3rd Edition; TOLD-P:3 = Test of Language Development-Primary: 3rd Edition; TOLD-I:3 = Test of Language Development-Intermediate: 3rd Edition; TOLD-P:2 = Test of Language Development-Primary: 2 Edition; TOLD-I:2 = Test of Language Development-Intermediate: 2 Edition; TOPL = Test of Pragmatic Language; VHI = Voice Handicap Index; WAB = Western Aphasia Battery

We combined the search results from the five databases for each instrument. This yielded 1,238 citations summing across all parts of Table 4. The number of citations sent for abstract review ranged from three (for the Dysarthria Examination Battery [DEB] and Voice Handicap Index [VHI]) to 256 for the Test of Language Development (TOLD). Of these citations, the reviewers judged that 92 met study inclusion criteria.

Of the articles that we excluded at the abstract review phase, we eliminated a substantial number because they were citations of non-peer-reviewed publications, dissertations, test reviews/critiques, manuals, or materials from the Mental Measurements Yearbook Series published by the Buros Institute in Nebraska. We excluded more than 75 citations because they did not describe one of the selected tests.

Of the 92 articles whose abstracts met inclusion criteria, we eliminated 53 at the data abstraction phase. Of these, 12 addressed the phonological process analysis. We dropped the remaining 41 because they did not meet the study inclusion criteria or did not address the version of the instrument selected by TEAG members. In cases where no peer-reviewed articles existed for the selected version, we included articles, but not manuals, from the version just prior to the one we had originally selected.

Data Abstraction Process

The data abstraction process included developing data extraction forms, training abstractors, and instituting a quality control process.

Data Extraction Forms

The study director worked with a psychometrician to develop the two data extraction forms needed to review the peer-reviewed literature and the instrument manuals. Because the selected instruments are for both adults and children and encompass language, speech, and voice disorders, we could not develop a single extraction form that would gather all necessary data. Rather than create a form for each combination of population and disorder (i.e., one each for adult speech, child speech, adult language, child language, voice) for a total of five forms or one for each instrument (20 forms), we opted to create two modular extraction forms that we could use with all instruments and for each population and disorder combination. The modular approach allowed abstractors to use only the portions that were relevant for a specific article or manual.

The forms collect information on study design, population studied, examiner qualifications and training, results (validity, reliability, predictive validity, and normative data), and limitations. The forms also provide essential instructions to abstractors and a description of designs used in the reliability testing and validation of instruments, the appropriate statistics, and benchmarks for evaluating whether an instrument had good reliability and validity. We tested our extraction forms to be sure that the instructions were clear and that abstractors would find the forms easy to use; we made changes as needed. We present the final forms in Figures D2 and D3 in Appendix D.

Abstractors and Training

Two quantitative psychology doctoral students conducted all data abstraction because we were unable to find abstractors who had appropriate methodological skills and speech and language clinical expertise. Both abstractors had course work in quantitative methods and in both classical test and modern test theory (as well as item response theory), and both had experience in the validation and standardization of educational tests. Each article or instrument was reviewed by one abstractor only. We did not blind the abstractor to journal, authors, or institutions.

Both abstractors received training from the study director and a psychometrician. This training included a question-by-question review of the data extraction form and practice abstraction of a manual and an article. The study director and psychometrician reviewed the extraction forms, correcting any errors in the forms and making certain that the abstractors understood the task. The study director monitored progress of the abstraction process and provided feedback to the abstractors, as did the psychometrician.

Quality Control Process

Although evidence-based medicine "best practices" require that two abstractors independently review each article, we were unable to conduct dual abstraction because we could not find enough reviewers with appropriate skills. Instead, the study director, a health services researcher with expertise in quantitative methods and systematic review, reviewed all data extraction forms and evidence tables. EPC clinical experts re-reviewed evidence tables in their areas of expertise to ensure that all information was reproduced accurately. Because the abstractors and EPC staff did not complete data extraction forms independently, we could not check the inter-rater reliability of the data abstraction process, as previous reports have done.

Quality and Strength of Evidence Evaluation

Rating the Quality of Individual Articles and Manuals

Quality rating items previously used by the RTI-UNC EPC are tailored to randomized and nonrandomized clinical trials, designs seldom used in psychometric evaluation; furthermore, these items do not address reliability and validity testing. Thus, we sought guidance from the educational and psychological testing literature for criteria and standards to use in our quality evaluation. The American Educational Research Association (AERA), American Psychological Association, and the National Council on Measurement in Education regularly publish standards for test construction, evaluation, and documentation. The most recent version, 1999 Standards for Educational and Psychological Testing, hereafter known as the 1999 Standards, forms the basis for our quality rating scales.28 Our use of the 1999 Standards is not precedent-setting. In 1984, McCauley and Swisher had adapted an earlier edition to evaluate the psychometric properties of preschool language and articulation instruments.29

Figures D4 and D5 in Appendix D present the quality rating items and scoring. The individual article quality rating forms compare 35 criteria (for a maximum score of 35 points) in seven main categories. They are research design and conduct, the measurement of reliability, validity, and development of instrument norms, justifications for conclusions, and external validity concerns. The manual rating form includes six additional criteria addressing aspects of instrument development or revision; the total score for this form is 41 points. We rescaled the quality score, reporting it as a percentage. Because some items on the forms did not apply to a particular article or manual, we subtracted the points for that item from both the numerator and the denominator when calculating the percentage score.

The EPC study director and clinical experts completed a quality form for each manual and article. The clinical experts evaluated instruments in their areas of expertise; the study director evaluated all manuals and articles from a methodological standpoint. We report the quality rating scores separately for the clinical and methodological experts rather than averaging the scores across the quality reviewers. We did not use quality scores for inclusion/exclusion decisions.

Evaluating the Psychometric Properties of the Instruments

Even when we examined the psychometric literature, we found it particularly difficult to identify thresholds for reliability and validity. Although the 1999 Standards 28 has reliability, validity, and normative data criteria, it does not venture to set thresholds for them.28 The Scientific Advisory Committee (SAC) of the Medical Outcomes Trust cited some standards for reliability coefficients,30 which generally match those found in seminal educational and psychological testing texts,31-33 but the SAC did not provide them for validity. Anastasi31 and Cronbach33 provide limited information for setting validity thresholds; we use what they presented in creating our criteria and thresholds. For normative data, we use the 1999 Standards 28 and the criteria employed by McCauley and Swisher.29

The criteria selected are those usually used to make individual comparisons, or in this case, decisions about individuals. It is important to minimize the possibility that a patient is misdiagnosed -- that is, to make certain that patients with true impairments are not missed and that "normal" individuals not be labeled as having an impairment. This is particularly important in the context of disability determination, where incorrect decisions can result in individuals being wrongly kept off or struck from SSA disability eligibility rolls or conversely in healthy individuals inappropriately being covered within the disability program.

From these various sources, we established the following criteria:

  • Reliability -- the criterion for reliability is "strictly" met if the following three conditions are all met:

    • Internal consistency reliability, measured using either Cronbach's coefficient alpha or Kuder-Richardson statistics (K-R 20), is greater than or equal to 0.90;

    • Test-retest/intra-rater reliability is greater than or equal to 0.90 if measured using a correlation coefficient, or greater than or equal to 0.80 if measured using Cohen's Kappa; and

    • Inter-rater reliability is greater than or equal to 0.90 if measured using a correlation coefficient, or greater than 0.80 if measured using Cohen's Kappa.

Some might reasonably argue that the criterion for internal consistency reliability is set too high given the complexity of speech and language functioning and disorders. Additionally the resultant variability in daily performance suggests that our criterion for test-retest reliability or intra-rater reliability also may be too high. Thus, we defined a "relaxed" criterion, which differs from the strict criterion in that internal consistency reliability may be as low as 0.80 and/or test-retest/intra-rater reliability may be as low as 0.80 (correlations) or 0.70 (Cohen's Kappa). The relaxed criterion is at a level suitable for having confidence in group, rather than individual, comparisons.

  • Validity -- the criterion for validity is met if the following conditions are all met:

    • Instrument developers examine relationships between subtests, composite scores, and total scores, establishing hypotheses a priori for these relationships and for patterns of scores for individuals belonging to various groups of import;

    • These relationships all are statistically significant at p < 0.05; and

    • In the case of correlation coefficients, the magnitude of the relationship is at least 0.30, thus providing evidence of a moderate correlation.

  • Normative Data -- the criterion for normative data is met if the following conditions are all met:

    • Data are available for the population targeted by the instrument;

    • An adequate sample size is used (i.e., at least 100 per group); and

    • Evidence is provided on how well the sample represents the population.

Grading the Strength of Available Evidence

Various systems for rating the strength of evidence have been developed for clinical topics and pathways.34 To date no consensus exists about which system, if any, is best or most appropriate. Evidence grading schemes previously used by the RTI-UNC EPC and the US Preventive Services Task Force (1996), although recognized for their utility for the projects for which they had been developed, provided little guidance for this particular topic.35

The type of literature that EPCs normally review, and upon which existing evidence grading scales are based, is substantially different than the literature available to address issues of diagnosing speech and language disorders and predicting the level and duration of disability from such conditions. The existing schemes are used with peer-reviewed literature and would not typically be applicable to the instrument manuals on which the majority of this report is based.

We have separated instrument manuals from peer-reviewed literature, emphasizing here that only two manuals were published in the peer-reviewed literature. Today's methods for conducting systematic reviews would dictate that we downgrade the manuals for not having been peer reviewed. However, several manuals documented that the development teams employed rigorous psychometric methods in the instrument development and validation process, and we wanted to reflect those efforts in our grading scheme. Also important was to be able to comment on the quality of the individual instruments, which would not have been possible if we graded using a system that assigned an "unacceptable" grade to all the manuals for reason of their not being peer-reviewed.

To provide some consistent basis for indicating the strength of the overall body of evidence for groups of instruments identified by age group and disorder, we set out the following definitions for both instrument manuals and peer-reviewed literature.

  • Acceptable: research or analyses were well conducted, had representative samples of reasonable size, and met our psychometric evaluation criteria discussed earlier.

  • Unacceptable: studies were poorly conducted, used small or nonrepresentative samples, or had results that did not meet or only partially met the psychometric criteria.

Development of Evidence Tables

To balance providing enough information to allow the reader to judge the quality of the literature against the volume and complexity of the literature, we developed a series of five evidence tables for each instrument. These tables provide what we considered the essential information to address the key questions. The first of these tables gives information on the study design and conduct and the quality scores assigned by the methodologist and the expert clinicians. The subsequent four tables describe the reliability, validity, predictive validity for future communicative functioning, and available normative data found in the reviewed articles and manuals.

The format and basic content of the evidence tables are similar across the instruments. When articles or instrument manuals presented multiple studies, we documented details of the different designs, samples, and results in the evidence tables. Where we reviewed different editions or versions (e.g., Spanish vs. English) of an instrument, we grouped data together by edition or version in the evidence tables. For example, we present all literature pertaining to the Test of Language Development:3 (TOLD:3) together. If we found no data for a particular outcome (e.g., reliability, validity, predictive validity, or normative data), we did not create an evidence table.

To conserve space in the evidence tables, we used numerous abbreviations. Although most of these abbreviations are self-explanatory, we provide a glossary of abbreviations and commonly used terms to assist the reader in reviewing the evidence tables.

Supplemental Analysis -- Usability Analysis

When deciding which instrument to use, a clinician must evaluate whether the manual provides sufficient information on how to administer and score the instrument. As part of our analyses, we evaluated the usability of the instrument manuals (additional details are provided in the Methods Appendix). Two speech and language pathology graduate students independently evaluated manuals using the Usability Evaluation Form (Figure D6 in Appendix D). In Chapter 3, we report the results of the analysis.

Table D4: Percentage Agreement and Inter-rater Reliability (Kappa)Appendix D: Methodology
Criteria% AgreementKappa
1. Instrument administration procedures can be duplicated94.1--- a
2. Scoring procedures can be duplicated82.4--- a
3. Examiner qualifications specified81.20.60
4. Required examiner training documented94.10.87
5. Environmental and equipment requirements described1001
6. Raw score scale meaning and interpretation described76.5--- a
7. Derived score scale meaning and interpretation described82.40.34
8. Scale construction described88.20.45
a

SAS could not calculate kappa because of missing data in the cells. When we replaced the missing values with 0.001, kappa values were 0.002, 0.0004, and 0.0002 for criteria 1, 2, and 6, respectively.

To assess inter-rater reliability, we computed Cohen's Kappa statistic36 and percentage agreement between the raters for the individual criterion (Table D4 in Appendix D). Kappa values for the individual criterion ranged from 0.34 to 1.00, suggesting poor to almost perfect agreement.37 Inter-rater agreement ranged from 76.5% (13/17) to 100% (17/17). The reviewers agreed most often on administration procedures, examiner training, and equipment and environmental needs, and least often on interpretation of raw scale scores.

Peer Review Process

A group of 18 clinicians, methodologists, representatives of professional societies, and potential users of the report, including TEAG members, were sent the draft evidence report. Of these 18, 10 returned a review. These peer reviewers provided comments on the content, structure, and format of the evidence report, paying particular attention the inclusion/exclusion of literature for the selected instruments, to the analysis and interpretation of study results and evidence, and to the discussion of gaps and areas that should be targeted for future research. Appendix C describes the selection process for peer reviewers and lists the names of all peer reviewers.

Methods Appendix: Explanation of Reliability and Validity

Reliability and validity are important properties of instruments, especially those used to assess speech and language disorders for disability determination. Every instrument must yield consistent measurement (i.e., it must be reliable) and must measure what it is intended to measure (i.e., it must be valid).31-33,38 Thus, an instrument measuring language disorders in children must consistently measure expressive and receptive language each time the instrument is administered. The questions that measure expressive language ability must indeed measure aspects of expressive language and not measure receptive language abilities.

Both reliability and validity are required of an instrument, but reliability takes precedence over validity. Specifically, an instrument that is valid but is not reliable (i.e., it does not produce consistent measurement) is not useful. Moreover, reliability and validity are not absolute concepts. Thus, it is correct to say that an instrument is reliable and valid for use with a particular population; it is not correct to say broadly that an instrument is reliable and valid. If it is important that an instrument be reliable and valid for use with a particular population (e.g., children who are mentally retarded or hard of hearing), then the instrument developers must test the instrument with these populations and report reliability and validity data separately for them.

This chapter appendix describes the forms of reliability and validity considered in this evidence report, documenting the types of statistics typically used to measure reliability and validity and the thresholds typically used to assess "acceptability."

Reliability

Reliability of measurement refers to the extent to which the observed variation in scores is due to variation in the "true" score (i.e., the "true" value of an underlying construct such as expressive language or articulation) rather than random error. Thus, the scores on an instrument, such as the TOLD-P:3 should not change unless underlying expressive or receptive language ability change. That an instrument is reliable is a necessary condition for use; an instrument that does not produce consistent results has little utility. When instrument developers measure reliability, they generally consider several types: internal consistency (or inter-item consistency), test-retest, intra-rater, and inter-rater (or inter-observer) reliability.

Internal Consistency Reliability

Internal consistency (inter-item consistency) reliability measures how well the individual items (or questions) within a scale or scales relate to composite scores (e.g., the individual questions that make up the various TOLD-P:3 subtests or the relationship of various subtests to the composite scores). Typically, internal consistency is reported as coefficient alpha33 or as Kuder-Richardson formula 20 (K-R 20)39 if the instrument items are scored on a 5-point scale (e.g., 5 = never, 4 = occasionally) or dichotomously (e.g., correct-incorrect), respectively. For scales or items measured on a continuous scale, instrument developers employ correlation coefficients (both the K-R 20 and coefficient alpha are correlation coefficients).

Some experts consider coefficients of 0.90 or greater acceptable for individual comparisons like those made in the evaluation of speech and language disorders; others suggest that 0.80 is an appropriate threshold for coefficient alpha.37,38,40,41 However, setting the threshold for coefficient alpha at 0.80 may be too strict for shorter scales because alpha increases as the number of items in the scale increases.38

Test-retest or Intra-rater Reliability

Test-retest reliability measures the consistency of scores obtained at two separate times. Intra-rater reliability measures whether, upon repeated administration, the examiner or instrument assigns the same scores to an individual. Both are reported as either correlation coefficients or as Cohen's Kappa42 depending upon how the scale is measured (e.g., as a continuous or dichotomous variable). Typically, instrument developers report test-retest or intra-rater reliability as Cohen's Kappa42 for dichotomous scales or as a correlation coefficient for continuous scales. Landis and Koch suggest that kappa values of less than 0.4 indicate poor agreement, 0.4 to 0.6 indicate moderate agreement, 0.61 to 0.8 suggest substantial agreement, and greater than 0.8 suggest almost perfect agreement.40

Inter-rater Reliability

Inter-rater reliability measures whether two observers or examiners score an individual, in this case an adult or child with a speech or language disorder, in the same way. Typically, instrument developers report inter-rater reliability as Cohen's Kappa42 for dichotomous scales or as a correlation coefficient for continuous scales. Thresholds similar to those described for test-retest and intra-rater reliability are used with inter-rater reliability.

Validity

The validity of an instrument refers to "how well it measures what it purports to measure"(p. 83).32 When a test developer validates or examines the validity of an instrument, she or he broadly examines the relationships between an individual's performance on the instrument and particular observable facts about the behaviors studied. In the case of a patient's expressive language abilities, the instrument items that are designed to measure expressive language should indeed measure the individual's ability. The procedures for evaluating validity have been given a variety of names over the years. We employ the names given in the 1999 Standards -- content-related, construct-related, and criterion-related validation.28

As we described earlier for reliability, validity should not be couched in terms of whether an instrument is "valid or not." Rather, the correct way to think about it is whether the instrument is valid for a particular purpose, population, and situation.

Content-related Validity

When evaluating content validity, test developers typically examine whether the items in the instrument adequately cover and represent the dimension(s) to be measured. As such, the evaluation of content validity is a subjective assessment, often conducted by experts, of how appropriate the items included in the instrument are. Typically, content validity assessment involves an organized review to make certain that relevant items are included and that inappropriate items are not.

Construct-related Validity

Construct validity is defined as "the extent to which the test may be said to measure a theoretical construct or trait"(p. 153).31 Construct validity involves the gradual accumulation of information from a variety of sources and studies, where test developers look for correspondence between the theory and available data.31,33,38

Convergent and divergent (or discriminant) validation are important aspects of construct validation activities.31,33,38 With convergent validity, test developers are interested in whether a particular instrument correlates highly with variables or characteristics that it should correlate highly with. For example, an instrument measuring articulation should correlate highly with other instruments that measure articulation. Divergent (or discriminant) validity is present when an instrument does not correlate significantly with variables from which it should differ. Typically, instrument developers report construct validity in the form of correlation coefficients between items on the instrument (or composite measures) representing a particular theoretical construct and other characteristics of the instrument. Factor analysis is often employed to identify underlying behavioral traits. In practice, instrument developers use factor analysis to identify items or subscales that are related to each other and to a particular traits of interest (e.g., verbal comprehension as measured by subscales addressing vocabulary, sentence completion skills).31

Criterion-related Validity

Criterion-related validity is a measure of how effective an instrument is in predicting an individual's performance in specific activities, such as performance on another similar instrument or future behaviors. In evaluating criterion validity, performance on an instrument "is checked against a criterion, that is, a direct and independent measure of that which the test is designed to predict"(p. 145).31 In the case of evaluation for speech, language, or voice disorders, the criterion might be performance on another instrument (concurrent validity) or future communicative function or performance (predictive validity).

Concurrent Validity

With concurrent validity, test developers are interested in the individual's performance on two instruments given simultaneously or after a short interval.28,31 The main purpose of concurrent validity testing is to develop an instrument as a substitute for a more time- and/or resource-intensive assessment procedure.31 Concurrent validity is measured as a correlation between the instrument and the criterion instrument; high correlation indicates concurrent validity. For concurrent validity testing, a critical factor is that the criterion instrument be a "gold standard," that is, an instrument that is relevant and well-known with demonstrated psychometric properties.31 Both concurrent and predictive validity are typically presented as correlation coefficients between the instrument and the criterion instrument or behavior; a high correlation.

Predictive Validity

With predictive validity, test developers are interested in whether the individual's performance on a particular instrument can be used to predict future events or performance. Predictive validity differs from concurrent validity primarily in that the interval between the administration of the instrument of interest and the latter measurement of behavior or performance is much longer. As with concurrent validity, instrument developers report predictive validity in terms of correlation coefficients.

Chapter 3. Results

This chapter documents our findings concerning the properties of 18 instruments to diagnose or assess speech and language disorders among adults and children. As explained in Chapter 2, these instruments were those given high priority by an expert panel early in this project and confirmed as instruments of interest to the Social Security Administration (SSA) with respect to its responsibilities for determining disability eligibility. We present the evidence on instruments ordered by a combination of age and disorder, as listed in Table 2 in Chapter 2: adult language disorders, child language disorders, adult speech disorders, child speech disorders, and voice disorders.

For each instrument, we first present a profile of relevant characteristics and other information of likely interest to potential users. This includes author and publisher information, target age groups and populations, estimates of administration time, the availability and type of normative data provided, components of the instrument package if acquired or purchased, a brief description of the administration and scoring procedures, and a listing of earlier versions of the instrument. These profiles (called out in parentheses for each instrument) can be found in Tables 5 through 22 at the end of this chapter. Following the instrument-specific profile, we document information pertaining to each of our two key questions.

Briefly, Key Question No. 1 relates to evidence about basic psychometric properties of these diagnostic tools and instruments. We thus present the literature addressing different types of reliability: internal consistency reliability, test-retest or intra-rater reliability, and inter-rater reliability, in that order. Next, we describe the available evidence for construct, concurrent, and content validity, again in that order. Finally, we describe the types of normative data and the populations to whom the normative data apply and assess the generalizability of the normative data (i.e., whether they were derived from a population including normal individuals or a population of only individuals with a speech-language disorder).

Key Question No. 2 pertains more narrowly to predictive validity -- i.e., the ability of these instruments to predict or forecast future functioning of patients (or school performance, in the case of children) diagnosed with a speech, language, or voice impairment. (Strictly speaking the issue is prediction of future impairment in age-appropriate daily activities.) In addition, because the SSA must concern itself with a wide range of patient populations, we conclude each instrument-specific section by examining the applicability of available evidence to these target populations (persons who do not speak English or who are cognitively impaired, mentally retarded, or hard of hearing, or who have learning disorders).

Table 24. Key Clinical Question 1: Evidence for Reliability, Validity, and Normative Data, by Instrumenta
InstrumentReliability bValidityNormative DataApplicability to c
  ConstructConcurrentPresentRep dN-EngCILDSLDMRHI
Adult Language
BDAE-2 X X       
PICARelaxedX         
WABStrict (original only)X (original only)XX       
Child Language
CELF-3Relaxed-total scores onlyX-composite scores only for all but 1 subtestXXX      
CELF-P XXXX      
CELF-3SpRelaxedXXX? eX     
PLS-3Relaxed-except 0-8 monthssXXXX      
PLS-3Sp     X     
TOLD-P:3RelaxedXXXX  XXXX
TOLD-I:3RelaxedXXXX  XXXX
TOPL XXXX      
Adult Speech
AIDS   X       
DEB           
SSI-3 XXX       
Child Speech
GFTA-2  XXX      
SSI-3 XXX       
Voice          
GRBAS X          
MDVPStrictN/AXX       
VHIRelaxedXXN/A      
a

Blank cells indicate criteria were not met or no data were found.

b

Strict: meets combined criteria of internal consistency reliability (ICR): alpha >0.90, test-retest reliability (T-RR): kappa >0.80 or correlation >0.90, and inter-rater reliability (I-RR): kappa >0.80 or correlation >0.90; Relaxed: meets combined criteria of ICR: alpha >0.80, T-RR: kappa >0.70 or correlation >0.80, and I-RR: kappa or correlation >0.90

c

N-Eng = Non-English speakers, CI = cognitively impaired, SLF = speech-language disordered, LD = learning disabled, MR = mentally retarded, HI = hearing impaired.

d

Rep = Representative of U.S. population.

e

? = Could not determine from information given. For instrument names, see Table 4.

To keep the text of this chapter manageable, we have elected not to comment in the narrative when no evidence is available on a given element of a key question for specific instruments. For example, if we identified no literature (either peer-reviewed or gray) on an evaluation criterion such as predictive validity (Key Question No. 2) or on internal consistency reliability (within Key Question No. 1), then those headings and subheadings will not appear for that particular instrument in this chapter. In short, this chapter documents what evidence is available; Table 24 (Chapter 4) and the discussion in Chapter 4 document the gaps in instrument-specific information and the deficiencies in the overall body of evidence.

Evidence Tables 1 through 72 systematically organize available information on each instrument, again following the order of instruments in this chapter. (Instrument-specific evidence tables are cited in parentheses.) Generally, each instrument has from three to five evidence tables. The first evidence table in the set gives study designs and other information about the empirical studies that we reviewed; succeeding evidence tables provide information about the outcomes of studies with respect to reliability, validity (construct, concurrent, and predictive), and the availability of normative data. Evidence tables also give the quality rating scores assigned to each article (according to the methods explained in Chapter 2).

This chapter concludes with a brief discussion of our supplemental analysis of the usability of the instruments. We defined usability to be how feasible and practical it is to use the instruments in everyday settings. Chapter 4 continues our examination of these results; Chapter 5 draws implications for future research.

Boston Diagnostic Aphasia Examination, 2nd Edition

Table 5. Boston Diagnostic Aphasia Examination, 2nd Edition (BDAE)
AuthorH. Goodglass and E. Kaplan
PublisherWilliams & Wilkins Rose Tree Corporate Center, Building II 1400 North Providence Road, Suite 5025 Media, PA 19063-2043 1-800-638-0672
Date of Publication1983
AgesAdults
Administration TimeNot specified
ScoresPercentile scores for all subtests, including severity rating, fluency, auditory comprehension, naming, oral reading, repetition, paraphasia, automatic speech, reading comprehension, writing, music, and spatial and computational
Normative Data ProvidedPerformance on all subtests is compared to a sample of 242 men with aphasia tested at the Boston Veteran's Administration Medical Center between 1976 and 1982.
ComponentsH. Goodglass and E. Kaplan, The Assessment of Aphasia and Related Disorders, 2nd Edition; Boston Diagnostic Aphasia Examination Booklet; 16 test stimulus cards; Boston Naming Test; Boston Naming Test scoring booklet.
ProceduresThe examiner first engages the patient in conversation, incorporating suggested questions from the BDAE. The examiner also asks the patient to tell about what is happening in a stimulus picture. Following the conversational and expository speech sample, the examiner completes an aphasia severity rating scale, and a rating scale profile of speech characteristics. Then the examiner proceeds through the BDAE subtests. The subtest order can be varied. Subtests involve a variety of tasks, including asking the patient to point to pictures names by the examiner, point to body parts, follow a series of commands, respond to yes/no questions, rapidly repeat mouth movements, rapidly repeat words, recite memorized materials, imitate words and phrases, read words, respond to questions requiring names, name pictures, discriminate printed letters and words, identify spelled words, complete written sentences by choosing the correct word, and execute some writing tasks. The examiner assigns points for each response based on the directions in the manual. Total raw scores are converted to a percentile score on the subtest summary profile. The overall profile for a patient across all subtypes is used to identify aphasia type, based on guidelines in the manual.
OtherThe BDAE includes supplementary language tests in both comprehension and expression, as well as The Boston Naming Test as an extended test of vocabulary naming. In addition, it included supplementary nonlanguage tests of spatial and quantitative performance and of apraxia. The BDAE, 3rd Edition, was published in 2001, too late for review for this report.
Earlier VersionsThe Boston Diagnostic Aphasia Examination111
The Boston Diagnostic Aphasia Examination, 2nd Edition (BDAE-2) is an instrument for the neuropsycholinguistic evaluation of adults with aphasia for one of three purposes: (1) diagnosis of presence and type of aphasia syndrome, leading to inferences regarding site of lesion; (2) measurement of the level of language performance over a wide range of abilities; and (3) comprehensive assessment of the assets and liabilities of the patient in all language areas as a guide to therapy (Table 5 and Evidence Tables 1-4).43 We did not identify any evidence for this instrument on test-retest or intra-rater reliability, concurrent validity, or predictive validity (Key Question No. 2).

Key Question No. 1

Goodglass and Kaplan standardized the BDAE-2 using 242 male patients treated at the Boston Veterans Administration (VA) Medical Center between 1976 and 1982.43 No additional information was provided about patient demographics. They made no claim that the standardization sample is representative of individuals with aphasia in the United States; thus, we ought to take care not to generalize the results beyond this population.

Reliability

Only the BDAE-2 instrument manual provided reliability data.43 These data were limited to internal consistency and inter-rater reliability (Evidence Table 2).

Internal consistency reliability

Goodglass and Kaplan reported reliability statistics (Kuder-Richardson statistics [K-R 20]) ranging from 0.68 to 0.90. Only 14 of the 21 subtest scores met our "strict" criteria for validity (i.e., they are at least 0.90).43 These data were derived from a small (n = 34) sample of men with varying types of aphasia; no additional information was provided to judge the external validity or generalizability of the results.

Inter-rater reliability

Goodglass and Kaplan provided inter-rater reliability data only for The Profile of Speech Characteristics, which assesses aphasia severity.43 Inter-rater reliability, reported as correlations, ranged from 0.78 to 0.90. Correlations at the lower end may be attributed to the highly subjective nature of the rated behaviors. Conversely, the instrument developers, who were highly familiar with instrument items and each other's ratings, were the raters in this evaluation; thus, the inter-rater reliability coefficients may have been slightly higher than might otherwise have been expected.

Validity

Only construct validity data were available for both the BDAE-243 and the original version (Evidence Table 3).44

Construct validity

Goodglass and Kaplan reported correlations between overall severity rating and BDAE-2 subtests but provided no a priori hypotheses about the relationships.43 Correlations between overall severity and the subtests ranged widely from −0.24 to 0.79. Intercorrelations of subtests ranged from −0.24 to 0.93.

Additional evidence suggests that the BDAE-2 and the original BDAE may not classify all individuals with aphasia as aphasic. Goodglass and Kaplan employed discriminant analysis with a small, highly selective sample of patients with Broca's, Wernicke's, conduction, and anomic aphasia.43 The discriminant analysis classified all but one patient into the original classification. Crary et al., using cluster analysis to evaluate how well the original BDAE classified aphasia patients, reported that only 38 percent of the cases matched the original classification.44 We are uncertain whether these results would be replicated with larger and more representative groups of adults with aphasia.

Content validity

Goodglass and Kaplan described the variety of deficits that occur in persons with aphasia.43 After explaining each area of deficit, they explained why and how the BDAE-2 addresses each deficit.

Available Normative Data

Goodglass and Kaplan presented a set of norms derived from a sample of 242 male patients treated in the Boston VA Medical Center (Evidence Table 4).43 They also evaluated the performance of neurologically normal men, developing a somewhat arbitrary cutoff score (i.e., lowest scoring normal) for aphasia. Most normal subjects scored the maximum number possible; elderly and less well-educated men were most likely not to achieve the maximum score. We are uncertain whether these norms can be generalized or used with typical aphasia patients, because they were derived from individuals for whom we know little more than sex and who were from a single institution.

Applicability of BDAE-2 to Target Populations

Goodglass and Kaplan provided no guidance on the instrument's use with any population of special interest to the SSA.43 Rosselli et al. provided thresholds, by educational attainment, for a Spanish translation of the original BDAE.45 However, because they derived these norms from Spanish speakers in a single region in Colombia, we may not be able to generalize them to all Spanish speakers or to Spanish-speaking aphasia patients in the United States.

Porch Index of Communicative Ability

Table 6. Porch Index of Communicative Ability (PICA)
AuthorB.E. Porch
PublisherPro-Ed 8700 Shoal Creek Boulevard Austin, TX 78757-6897 512-451-3246 http://www.proedinc.com/
Date of Publication1981
AgesAdults
Administration TimeApproximately 1 hour
ScoresSubtest means, representing average rated complexity level of response to items within subtest (range of 1-16); overall score, representing mean response level across all subtests; mean modality response level (writing, copying, reading pantomime, verbal, auditory, visual, gestural, and graphic modalities); high and low percentile ranks for subtest scores, modality scores, and overall score.
Normative Data ProvidedPercentile ranks based on two population samples: 357 adults with left hemisphere damage and 100 adults with bilateral damage
ComponentsTheory and development manual; administration, scoring and interpretation manual; test format booklet; test objects; stimulus cards; graphic test sheets; score sheet.
ProceduresExaminer and patient are seated at a table. Examiner administers all 18 subtests according to instructions summarized in test format booklet. Responses elicited include verbal naming, description, pantomime and actual demonstration of object use, receptive identification of objects, reading, writing, and drawing. The examiner scores the patient's responses to each item on a scale of 1 to 16, based on criteria described in the administration, scoring, and interpretation manual.
OtherManual indicates that approximately 40 hours of training are required to assure reliable administration and scoring of the PICA.
Earlier VersionsThe volume of the PICA manual for administration, scoring and interpretation of the test is in its third edition (1981). The volume regarding theory and development was published in 1967.46
The Porch Index of Communicative Ability (PICA) was designed to evaluate the communicative abilities of individuals with aphasia, including aspects of their verbal, gestural, and graphic communication (Table 6 and Evidence Tables 5-7).46 The primary goal was to develop an instrument that would reliably and sensitively quantify a patient's communicative ability so that clinicians and researchers can assess the effect of variables such as treatment, time, surgery, and drugs on communication. We found no evidence for this instrument on concurrent validity or normative data.

Key Question No. 1

Porch standardized the PICA using 150 adults with a diagnosis of brain injury or referral for investigation of a communication disability (Evidence Tables 5-7).46 He described the age, race, educational attainment, and time since onset of aphasia. No information was provided with which to judge whether this sample was representative of aphasic adults in the United States.

Reliability

Only the instrument manual provided evidence of reliability (Evidence Table 6);46 data were available for internal consistency, test-retest, and inter-rater reliability. These results were based on a relatively small subsets (n = 30 to 40) of the standardization sample; no information was given on whether the sample was representative of aphasic adults in the United States.

Internal consistency reliability

Porch reported Spearman-Brown reliability coefficients and correlation coefficients based on 30 individuals from the standardization sample. Spearman-Brown coefficients ranged from 0.82 to 0.99 across judges and subtests, with the lowest coefficients observed for the gestural subtests.46 Correlation coefficients ranged from 0.88 to 0.99. Collectively, these coefficients met our "relaxed" criterion for internal consistency reliability.

Test-retest or intra-rater reliability

Correlations for the 18 subtests ranged from 0.70 to 0.99. Correlations for the different communication modalities were as follows: gestural, 0.96; verbal, 0.99 and graphic, 0.96. The overall correlation was 0.98.46 The overall score met our intra-rater/test-retest criterion; several of the subtest scores did not.

Inter-rater reliability

Inter-rater correlations were greater than 0.93 for the subtests and greater than 0.97 for response modalities.46 Thus, they met our inter-rater reliability criterion. Porch reported significant variance in three subtest scores, the overall test, and gestural responses, but he attributed them to a single scorer.46

Validity

Construct validity data were available in the instrument manual46 and in one peer-reviewed article (Evidence Table 7).47 We identified no concurrent validity data.

Construct validity

Porch reported correlations between modality and subtest scores and age and educational attainment.46 Correlations between age and overall scores and gestural and graphic modality scores were statistically significant but lower than the 0.30 threshold (r = −0.17 to −0.18). Educational level was significantly correlated with the graphic modality score and with nine subtests, but the magnitude of correlations generally fell below 0.30. Porch also reported that correlations of subtests within communication modalities were higher than those between modalities; all correlations were greater than 0.30, suggesting evidence of construct validity for his population of individuals with communication disorders attributed to brain injury or another cause.46 Clark et al., using principal components analysis of data from 148 brain-injured adults, provided additional evidence suggesting construct validity for use with individuals whose aphasia can be attributed to brain injury.47

Content validity

Porch discussed theory relating to aphasia and various PICA subtests.46

Key Question No. 2

Two studies provide limited and contradictory evidence of the ability of the PICA to predict future impairment as measured by the PICA at 6 months (Evidence Table 8).48,49

Lendrem and Lincoln, using data from 52 mildly to moderately aphasic adults, found that PICA verbal, gestural, and graphic components accounted for 69 percent of the total variance in 34-week scores, suggesting that early PICA scores (at 4 weeks) could be used to predict impairment at 6 months.48 In a related study, Lincoln and McGuirk compared two methods (e.g., slope of improvement and a statistical t-test) for predicting impairment.49 Their sample differed from that of Lendrum and Lincoln in that 68 of the 124 adults received speech-language treatment. Neither prediction method fared well. The proportions of patients with accurate predictions (i.e., predicted within plus or minus 10 percent of actual score) were low for both treated and untreated patients, suggesting that PICA scores poorly predict recovery. Both studies employed relatively small samples and provided little information to allow readers to assess whether the results could be replicated with a different group of aphasic adults.

Western Aphasia Battery, 2nd Edition (WAB)

Table 7. Western Aphasia Battery (WAB)
AuthorA. Kertesz
PublisherThe Psychological Corporation 555 Academic Court San Antonio, TX 78204 Voice 1-800-872-1726 Fax 1-800-232-1223 http://www.psychcorp.com/
Date of Publication1982
AgesAdults
Administration TimeApproximately 1 hour for oral language portion (yielding Aphasia Quotient)
ScoresSubtest scores for spontaneous speech, comprehension, repetition, and naming are combined to yield Aphasia Quotient; subtest scores for reading and writing, praxis, and construction are combined to yield Cortical Quotient. Patient may be classified according to aphasia type based on oral language subtest scores using taxonomic table in manual.
Normative Data ProvidedNone
ComponentsTest manual; objects; stimulus cards; blocks (for block design task); test booklet.
ProceduresExaminer and patient are seated at a table (or patient may be tested at bedside). All items are administered to each patient. Examiner first engages patient in conversation and then scores informational content and fluency of spontaneous speech according to scale provided in test booklet. Other test items used to calculate Aphasia Quotient include responding to questions; identifying objects, body parts, pictures, letters, and numbers; following directions; imitating words; and naming objects. Test items used to calculate Cortical Quotient include reading sentences, following written directions, oral spelling, writing a story, writing to dictation, copying written words, gesturing, drawing, completing mathematical calculations, block design, and completing the Raven's Colored Progressive Matrices. Examiner scores responses according to instructions provided in the test booklet.
OtherAudiotaping of the test session is recommended.
Earlier Versions1977 version.51
The Western Aphasia Battery (WAB) (Table 7 and Evidence Tables 9 to 11) was designed to identify aphasia syndromes and their severity through the evaluation of clinical aspects of oral language functioning as well as reading, writing, calculation ability, and nonverbal skills.50,51

Key Question No. 1

Shewan and Kertesz standardized the original WAB in Ontario, Canada, using a sample of 150 aphasic adults and 59 control subjects.51 The subjects included individuals with communication disorders resulting from brain injury and other causes. They provided limited demographic information (age only); therefore, we were unable to judge whether these individuals were representative of aphasic and/or "normal" adults in the United States. Kertesz did not provide formal standardization of the revised WAB; rather, he presented only a comparison between the revised and original instruments.50

Reliability

Shewan and Kertesz51 provided evaluations of internal consistency, test-retest, inter- and intra-rater reliability for the original WAB.52 Shewan presented additional data on the internal consistency reliability of the WAB language quotient subscore (Evidence Table 10).53

Internal consistency reliability

No internal consistency data were reported for the revised WAB. Shewan and Kertesz reported internal consistency reliability for the original WAB tests administered to 140 aphasic subjects from the standardization sample.51 Cronbach's coefficient alpha was 0.91 and Bentler's coefficient theta (for tests that combine subscores to yield a composite score) was 0.97, suggesting high internal consistency.

Shewan presented internal consistency data for the Language Quotient (LQ)using data from 94 aphasic patients from a larger study on language therapy and recovery from aphasia.53 These individuals were similar to the standardization sample used earlier by Shewan and Kertesz.51 Specifically, 55 were male and 39 were female; all were functional English speakers prior to aphasia, were on average 65 years of age (range 29 to 85), and represented a variety of different aphasic categories. Cronbach's coefficient alpha for the LQ was 0.91 and Bentler's coefficient theta was 0.97, suggesting high internal consistency. The consistency of these results with those presented earlier by Shewan and Kertesz suggests that the 1986 study may be a repetition of the earlier results; it is not possible to determine this definitively.

Test-retest or intra-rater reliability

Three studies reported data on test-retest reliability,51,53,54 although the 1986 Shewan53 study appears to re-report the 1980 Shewan and Kertesz data.51

Shewan and Kertesz examined test-retest reliability using 38 chronic aphasic subjects, who were stable at the time of initial testing;51 Shewan again reported the analysis of these subjects with respect to the subtests that make up the LQ.53 Of the 38 individuals who had completed the spoken language section of the WAB, 32 completed the Reading subtest, and 25 completed the Writing subtest on each of two test sessions separated by at least a 6-month period (range = 6 months to 6 years and 6 months; median = 12 to 23 months). Different examiners assessed 22 of the 38 individuals on retest. Pearson's correlation coefficient for the individual subtests ranged from 0.88 to 0.97 (Praxis had a correlation of 0.58 but was not reported by Shewan). Correlations for the Aphasia and Cortical Quotients were 0.97 and 0.90, respectively.51 The mean absolute score difference between the test and retest was less than 10 points for each subtest, indicating less than 10 percent variation.

Kertesz and McCabe, in looking at recovery patterns and prognosis for 93 persons with aphasia, administered the WAB at 45 days post-onset, and 3, 6 and 12 months and yearly thereafter.54 They looked at the Language Quotient (LQ) of the WAB separately in a subgroup of 22 patients who had chronic, long-term aphasia and for whom little recovery was observed. This group had a Pearson's correlation coefficient of 0.992, significant at p < 0.01, suggesting high test-retest reliability in a population with stable aphasia.

Shewan and Kertesz reported intra-rater reliability results for the eight subtests and two composites (Aphasia and Cortical Quotients) of the original WAB;51 Shewan reported the same results. In this investigation, three judges scored 10 videotaped administrations of the original WAB twice, with a between-test interval of several months. No data were provided on the aphasic subjects other than to indicate that they varied in severity and type of aphasia. Intra-rater correlation coefficients ranged between 0.79 and 1.00, and all but one were significant at p < 0.001. Of six correlation coefficients that were less than 0.98, five were from Information Content and Fluency subtests. Shewan and Kertesz suggested that these smaller correlations could likely be attributed to the nature of the two subtests (i.e., single-item rather than multiple-item scales and differences in weighting schemes for transforming raw scores to standard scores).51

Inter-rater reliability

Shewan and Kertesz evaluated inter-rater reliability by having eight judges score videotaped original WAB test administrations to 10 aphasic individuals;51 again Shewan re-reported the data for the six LQ subtests. As in their reports on intra-rater reliability, the authors gave no descriptive data about the aphasic subjects. For the Writing and Construction Subtests, the subjects' original performance was available for scoring. Correlations by subtest and judge were greater than 0.90 for all subtests except Fluency, which ranged from 0.77 to 0.84 for the eight judges.

Validity

The WAB-2 manual provided no information on the validity of the instrument. Instead, construct validity data for the original WAB appeared in several peer-reviewed articles (Evidence Table 11).44,53,55 Kertesz provided evidence of concurrent validity of the revised WAB-2.50

Construct validity

Four peer-reviewed articles addressed various aspects of the construct validity of the original WAB.44,51,53,55

Shewan and Kertesz employed principal components analysis to previously collected data on 142 individuals with aphasia.51 Four components accounted for 100 percent of the total variance, with the first factor accounting for 82 percent. The five WAB subtests contributed equally to the first factor. Additionally, they reported evidence that WAB scores differed significantly between aphasics and nonaphasics and nonaphasic subjects (F = 369.4, p < 0.001).

Crary and Gonzalez-Rothi evaluated the relationships between the 10 WAB subtests and the Aphasia Quotient (AQ).55 Correlation coefficients (not corrected for multiple comparisons) between the subtests and the AQ ranged from 0.56 to 0.93, all significant at p < 0.05. The inter-subtest correlation coefficients ranged from 0.32 to 0.89. One subtest, Information Content, accounted for 87 percent of variability of the AQ, suggesting that nine of the subtests contributed minimal information in comparison with the Information Content subtest. Just as the original WAB data all came from Western Ontario, these data all came from Florida, possibly limiting the generalizability of this information.

Shewan, using data from the original validation of the WAB, examined subsets of subjects to compare subsets of the WAB.53 She found that time is a significant predictor of the LQ (F = 43.33, p < 0.00001). LQ scores increased over testing sessions by 27.17 and 11.72 for treated and untreated groups, respectively. The LQ scores of mild, moderate, and severe aphasic patients increased over time, based on initial AQ. Time and severity had significant main effects (time, F = 106.64, p < 0.00001 and severity, F = 77.25, p < 0.00001). However, LQ correlated 0.98 with the overall AQ, suggesting that the LQ provided no new information. The sample sizes were small and Shewan was reporting data from earlier studies; these factors weaken the validity of the WAB.

Finally, Crary et al. used a cluster analysis to estimate how well the original WAB classified aphasia patients.44 They analyzed scores on the original WAB for 47 patients who were aphasic after having experienced a single, thromboembolic cerebrovascular accident (CVA). Patients were on average 57.7 years (range = 26-84), had completed 12 years (range, 8-16) of education, and were 16.1 months (range = 1-80) post CVA. Crary et al. used a statistical clustering technique to analyze the original WAB scores and then compared the results of the clustering to the original WAB classifications.44 The original WAB scores resulted in three clusters that accounted for 97 percent of the total variance. Crary et al. computed the average value of each classification variable for patients within each cluster and compared them to the original WAB classification taxonomy.44 They then compared the resultant classifications to those derived from the scores on the original WAB. Only 30 percent of the cases matched the original classification. Although these results may suggest that the WAB may not classify patients consistently, we urge that the results be interpreted with caution given the relatively small sample size and the lack of descriptive information about the patients in the Crary et al. study.

Concurrent validity

Kertesz compared original and revised WAB scores for 20 consecutive patients.50 Pearson correlations for all subtests ranged from 0.85 to 0.99; all correlations were significant at p < 0.001 except the Spoken Word-Written Choice subtests (p < 0.01). The AQ correlation was 0.99.

Shewan and Kertesz reported evidence of both concurrent and divergent validity.51 As evidence of concurrent validity, they reported correlations between WAB subtests and Neurosensory Center Comprehensive Evaluation of Aphasia (NCCEA) subtest scores ranging from 0.82 to 0.92; the correlation between total scores for both instruments was 0.97, suggesting strong evidence of concurrent validity. They also reported a correlation of 0.57 between the WAB total score and Raven's Coloured Progressive Matrices (RCPM), a test unrelated to aphasia.

Content validity

Shewan and Kertesz discussed the point that the WAB subtests assess all language modalities and are comparable to previous aphasia batteries, including the original BDAE.51

Available Normative Data

Normative data, per se, were not provided for either version of the WAB. However, subtest scores allow the examiner to classify aphasic patients using WAB scores derived from the original standardization sample.51 The members of the standardization sample may not be representative of individuals with aphasia because they were all from Western Ontario, Canada, and included both traditional aphasic (post-stroke) patients and traumatic brain injury patients. These groups performed very differently on language tests. In a follow-up article, Shewan presented data from new subjects, but all were from the state of Florida.53 They did include traumatic-brain-injured patients, who may be more cognitively impaired than a traditional aphasic subject pool.

Key Question No. 2

We identified no evidence of predictive validity for the second edition of the WAB. However, Lincoln et al. examined the ability of the original WAB to predict activities of daily living (ADLs) (Evidence Table 12).56 They evaluated 54 patients on a stroke unit, administering the original WAB at admission, at discharge, and at 9 months after stroke. They used stepwise multiple regression to report that the WAB Reading and Writing Quotient accounted for 7 percent of the total variability in Extended ADL Mobility, 16 percent of Extended ADL Kitchen, and 34 percent of Extended ADL Leisure measured at 9 months post admission. These results, from Nottingham in the United Kingdom, may not generalize to other countries in general or to the United States in particular. Factors such as availability of home health care and rehabilitation may affect recovery and thus be very important in contributing to ADL measures.

Clinical Evaluation of Language Fundamentals, 3rd Edition (English)

Table 8. Clinical Evaluation of Language Fundamentals, 3rd Edition (CELF)
AuthorE.H. Wiig, W.A. Secord, and E. Semel
PublisherPsychological Corporation P.O. Box 839954 San Antonio, TX 78283-3954 1-800-228-0752 http://www.psychcorp.com/
Date of Publication1995
Ages6-0 through 21-11 years
Administration Time30-45 minutes for the six subtests constituting the Receptive Language and Expressive Language composite scores, and Total Language Scores; approximately 60 minutes for all 11 subtests
ScoresStandard scores and percentile ranks with computed 68 percent and 90 percent confidence intervals, stanines, normal curve equivalents, and age equivalents
Normative Data ProvidedNorms for the individual subtests, and the Receptive and Expressive Language Composite Scores, and the Total Language Scores are reported for each 1-year age group from 6-0 through 16-11 years and from 17-0 through 21-11 years.
ComponentsExaminer's manual, two stimulus manuals with easels containing visual stimuli, record form, and a software scoring program.
ProceduresThe examiner shows the examinee the stimulus items and follows the specific instructions for the items and subtests. Correct responses are coded as 1, incorrect or no response as 0 for some subtests and by different approaches for others. Calculation of subtest totals is described in detail in the manuals and varies by subtest. Subtest raw scores are converted to norm-referenced standard scores using tables provided in the manual. The Receptive Language Composite Score is the sum of the standard scores of the age-appropriate subtests; the score then is converted to a standard score. A similar approach is used to derive a standard score for the Expressive Language Composite Score. The Total Language Score is calculated as the sum of the Receptive and Expressive Language standard composite scores and then converted to a standard score using the table in the manual.
OtherNone
Earlier VersionsClinical Evaluation of Language Fundamentals-Revised (1987)62
The Clinical Evaluation of Language Fundamentals, 3rd Edition (CELF-3) was designed to identify, diagnose, and perform follow-up evaluations of language deficits in children, adolescents, and young adults (Table 8 and Evidence Tables 13-17).57 It assesses receptive and expressive language abilities in the areas of verbal concepts and directions, word associations, semantic relationships, word and sentence structure, and recall of spoken language.

Key Question No. 1

Semel et al. standardized the CELF-3 using 2,450 English-speaking children, adolescents, and young adults in 47 states.57 None of the sample was receiving language therapy or had a diagnosed or identified language disorder. The sample was representative of US children with respect to race, sex, residence (urban or rural), family income, educational attainment of parents, and geographic region

Reliability

Data on the CELF-3's reliability were available only in the instrument manual (Evidence Table 14).57

Internal consistency reliability

Semel et al. reported Cronbach's coefficient alphas by age ranging from 0.83 to 0.95 for Receptive Language and Expressive Language subscales and the Total Language score; all alphas for the Total Language score were above 0.90.57 Thus, the CELF-3 met our "relaxed" criterion.

Test-retest or intra-rater reliability

For the Receptive Language and Expressive Language composite scores, mean test-retest correlations across age were 0.80 and 0.86 respectively. For the Total Language Score, mean correlation was 0.91. Thus, the total score met our criterion; composite scores did not.

Inter-rater reliability

Semel et al. reported inter-rater reliability for the Formulated Sentences and Word Associations subtests, which require clinical judgment in scoring for three different age groups: 6-year-olds, 11-year-olds, and 16-year-olds.57 For Formulated Sentences, correlations ranged from 0.70 to 0.91 (lowest correlation in oldest group). For Word Associations, correlations ranged from 0.97 to 0.99. The CELF-3 met our criterion except for the Word Association subtest in older children.

Validity

Reports of validity data were available for the both the CELF-357 and for the earlier CELF-R (Evidence Table 15).58-60

Construct validity

Construct validity data were available for both the CELF-357 and the CELF-R.58 Semel et al. examined the relationships between the subtests for ages 6 through 8 years and for ages 9 years and older; correlations ranged from 0.25 to 0.63 with only one falling below our 0.30 threshold. They also conducted a discriminant analysis to determine the extent to which the CELF-3 would discriminate between children and adolescents with and without language disorders. They reported the overall agreement (71.3 percent) between the CELF-3 (mean minus 1 standard deviation as cut-off) and school system classification; we calculated the sensitivity (0.80), specificity (0.67), and positive and negative predictive values (0.57 and 0.85, respectively). Factor analysis using the standardization sample suggested that the CELF-3 captures a single factor measuring language.

Perez et al. reported correlations between CELF-R Total score and Receptive and Expressive Language composite scores of 0.90 and 0.93, respectively.58

Concurrent validity

Semel et al. provided evidence of concurrent validity for the CELF-3.57 Perez et al., Kotsopoulos et al., and Lewis et al. provided data for the CELF-R.58-60

Semel et al. examined the concurrent validity of the CELF-3 in relation to the CELF-Preschool (CELF-P),61 the CELF-Revised (CELF-R),62 and the Wechsler Intelligence Scale for Children, Third Edition (WISC-III).63 They reported correlations between the CELF-3 and the CELF-R: Expressive, Receptive, and Total Language scores ranging from 0.72 to 0.79 for a subset of the standardization sample and 0.68 to 0.83 for language-disordered children receiving therapy. In a comparison of the CELF-3 and CELF-P, they reported correlations between related subtests ranging from 0.29 to 0.58, and between composite scores ranging from 0.49 to 0.63. Correlations among CELF-3 and WISC-III composite scores ranged from 0.56 to 0.75. Taken together, these results suggest acceptable concurrent validity.

Perez et al. examined the relationships between the CELF-R and the Test of Nonverbal Intelligence (TONI),64 and the CELF-R and the Slossen Intelligence Test (SIT),65 for children with learning difficulties who had been referred for central processing evaluation.58 Correlations between the CELF-R and TONI composite scores ranged from 0.45 to 0.58; correlations between CELF-R and SIT composite scores ranged from 0.59 to 0.69.

Kotsopoulos et al. also reported correlations between the CELF-R subtests and scores for reading comprehension, decoding, and spelling using the Kaufman Test of Educational Achievement (KTEA)66 for children with severe behavioral and other psychiatric disorders.59 CELF-R subtests were moderately to highly correlated with the KTEA subtests as follows: Reading Decoding (0.54 to 0.67), Reading Comprehension (0.62 to 0.73), and Spelling (0.44 to 0.61).

Finally, Lewis et al. reported correlations between CELF-R scores and the Test of Written Language, Second Edition (TOWL-2) Spontaneous Writing subtest scores67 for children diagnosed with phonological disorders as preschoolers.60 They reported significant correlations for the CELF-R Sentence Assembly subtest and Total Language scores with the TOWL-2 Syntactic subtest, but only for children who had phonological disorders.

Content validity

Semel et al. documented content validity by developing a conceptual model and describing relationship between subtest content and the model.57 Additionally, they used a panel of speech-language pathology experts to review all CELF-3 materials for sex, racial/ethnic, and regional biases and employed statistical methods to identify biased items.

Available Normative Data

Semel et al. presented normative data for the CELF-3 (Evidence Table 16).57 The total standardization sample closely matched the US population with respect to geographic region, sex, mother's education, and race or ethnic group.

Applicability of CELF-3 to Target Populations

Semel et al. provided no guidance on the use of CELF-3 with populations of interest to the SSA.57 They excluded from the standardization sample children diagnosed with a language disorder or a hearing deficit.

Key Question No. 2

We identified no evidence on the CELF-3 concerning predictive validity. Kotsopoulos et al. reported data on the predictive validity of the original CELF-R instrument, however (Evidence Table 17).59 CELF-R scores predicted gains in reading and math but not spelling among children receiving treatment for severe behavioral and psychiatric disorders.

Clinical Evaluation of Language Fundamentals, 3rd Edition (Spanish)

Table 9. Clinical Evaluation of Language Fundamentals, 3rd Spanish Edition (CELF-3Sp)
AuthorsE. Semel, E.H. Wiig, and W.A. Secord
PublisherThe Psychological Corporation 555 Academic Court San Antonio, TX 78204 Voice 1-800-872-1726 Fax 1-800-232-1223 http://www.psychcorp.com/
Date of Publication1997
Ages6-0 through 21-11 years
Administration Time20-45 minutes for core subtests; approximately 1 hour for all subtests (core, supplemental, and optional)
ScoresAge equivalents, percentile ranks, and standard scores for subtests: Estructura de oraciones (Sentence Structure), Conceptos y direcciones (Concepts and Directions), Clases de palabras (Word Classes), Estructura de palabras (Word Structure), Formulación de oraciones (Formulated Sentences), Recordando oraciones (Recalling Sentences), Asociación de palabras (Word Association), and Escuchando párrafos (Listening to Paragraphs); and for composites: Spanish Receptive Language, Spanish Expressive Language, and Spanish Total Language. Means and standard deviations of raw scores are provided for errors and time for the optional subtest, Enumeración rápida y automática (Rapid Automatic Naming).
Normative Data ProvidedData are provided at 1-year intervals from ages 6-0 to 16-11 years, and for the 5-year interval from age 17-0 to 21-11 years, based on a standardization sample of 1,050 individuals from 19 states who spoke varying dialects of Spanish.
OtherNone
Earlier VersionsNone
The Spanish version of Clinical Evaluation of Language Fundamentals, 3rd Edition (CELF-3Sp) was designed to identify, diagnose, and perform follow-up evaluations of language deficits in children, adolescents, and young adults.68 Like the CELF-3, it assesses receptive and expressive language abilities in the areas of verbal concepts and directions, word associations, semantic relationships, word and sentence structure, and recall of spoken language (Table 9). We identified no evidence of predictive validity (Key Question No. 2) for this instrument.

Key Question No. 1

Semel et al. standardized the CELF-3Sp using 1,050 Spanish-speaking children, adolescents, and young adults from 19 states (Evidence Tables 18-21).68 Individuals were included only if they were fluent in Spanish. Like the CELF-3 standardization sample, none had hearing deficits or diagnosed language disorders. The investigators documented that the sample was representative of the US Hispanic population with respect to educational attainment of parents and geographic region.

Reliability

Reports of the CELF-3Sp's reliability were available only in the instrument manual (Evidence Table 19).68

Internal consistency reliability

Semel et al. reported Cronbach's coefficient alphas by age for Receptive Language (α = 0.82-0.90), Expressive Language (α = 0.89-0.95) composite scores, and the Total Language score (α = 0.91-0.95); all alphas for the Total Language score were greater than 0.90.68 Thus, the CELF-3Sp met our "relaxed" criterion.

Test-retest or intra-rater reliability

For the Receptive Language and Expressive Language composite scores, mean test-retest coefficients across were 0.79 and 0.85, respectively. For the Total Language Score, the mean test-retest across age was 0.87. Thus, the CELF-3Sp did not meet our criterion.

Inter-rater reliability

Semel et al. reported inter-rater reliability only for the Formulated Sentences and Word Associations subtests in selected age groups.68 For Formulated Sentences, correlations ranged from 0.79 to 0.82 (lowest correlation in oldest group). For Word Associations, correlations ranged from 0.97 to 0.98.

Validity

Semel et al. presented CELF-3Sp data on construct, concurrent, and content validity (Evidence Table 20).68

Construct validity

Semel et al. examined the relationships between the subtests for children ages 6 through 8 years and for those ages 9 and older.62 Correlations ranged from 0.13 to 0.65 for the younger age group, and 0.21 to 0.50 for the older age group, with most above our 0.30 threshold. They also conducted a discriminant analysis to determine the extent to which the CELF-3Sp would discriminate between children and adolescents with and without language disorders. Semel et al. reported overall agreement was 71.6 percent; we calculated the sensitivity (0.75), specificity (0.69), and positive and negative predictive values (0.65 and 0.78, respectively). Factor analysis suggested that, like the CELF-3, the CELF-3Sp captures a single factor measuring language skills.

Concurrent validity

Semel et al. provided limited concurrent validity evidence, comparing the CELF-3Sp to the Spanish version of the CELF-3 Observational Rating Scales (ORS).68 Parents, teachers, and children with language disorders completed the ORS. Only correlations between CELF-3Sp Total score and teacher or child ratings (−0.45 and −0.57, respectively) were significant for children with speech-language disorders. Correlations were smaller in magnitude for children without speech-language disorders. Correlations in the negative direction were expected because higher ratings on the ORS are associated with more language problems..

Content validity

Semel et al. provided extensive evidence of the CELF-3Sp's content validity by describing the conceptual model employed and using a panel of experts in bilingual assessment or Spanish language issues to review test content for biases.68

Available Normative Data

Semel et al. documented that the total standardization sample closely matched the US population with respect to the proportion of the nation's Hispanic population located in various geographic regions, as well as for sex and educational level of the primary caregiver (Evidence Table 21).68 However, we found no information on how representative these data are of the US Hispanic population as a whole.

Applicability of CELF-3Sp to Target Populations

Semel et al. provided no guidance on the use of the CELF-3Sp with any populations of interest to the SSA.68 None of the individuals in the standardization sample was diagnosed with a language disorder or a hearing deficit.

Clinical Evaluation of Language Fundamentals-Preschool

Table 10. Clinical Evaluation of Language Fundamentals-Preschool (CELF-P)
AuthorE. Semel, E.H. Wiig, and W.A. Secord
PublisherThe Psychological Corporation P.O. Box 839954 San Antonio, TX 78283-3954 1-800-228-0752 http://www.psychcorp.com/
Date of Publication1992
Ages3-0 through 6-11 years
Administration Time30-45 minutes depending upon the age of the child and his or her level of cooperation with the examiner
ScoresStandard scores and percentiles ranks with computed 68% and 90% confidence intervals, and age equivalents
Normative Data ProvidedNorms for the individual subtests, and the Receptive, Expressive, and Total Language Scores reported for each 6-month age group from 3-0 through 6-11 years.
ComponentsExaminer's manual, three stimulus manuals with easels containing visual stimuli, and record form.
ProceduresThe examiner shows the examinee the stimulus items and follows the specific instructions for the items and subtests. Correct responses are coded as 1, incorrect or no response as 0 for some subtests and by different approaches for others. Calculation of subtest totals is described in detail in the manuals and varies by subtest. Subtest raw scores are converted to norm-referenced standard scores using tables provided in the manual. The Receptive Language Composite Score is the sum of the standard scores for the Linguistic Concepts, Sentence Structure, and Basic Concepts subtests; the score then is converted to a standard score. A similar approach is used to derive a standard score for the Expressive Language Composite Score (Recalling Sentences in Context, Formulating Labels, and Word Structure subtests). The Total Language Score is calculated as the sum of the Receptive and Expressive Language standard composite scores and then converted to a standard score using the table in the manual.
OtherDownward (in terms of age) revision of the Clinical Evaluation of Language Fundamentals-Revised (1987)62
Earlier VersionsNone
The Clinical Evaluation of Language Fundamentals-Preschool (CELF-P) was designed to identify, diagnose, and perform follow-up evaluations of language deficits in preschool children (Table 10).61 Like the CELF-3, it assesses receptive and expressive language abilities in the areas of word meanings, word and sentence structure, and recall of spoken language. We identified no data on predictive validity (Key Question No. 2) for this instrument.

Key Question No. 1

Wiig et al. standardized the CELF-P using 800 English-speaking children ages 3-0 through 6-11 years (i.e., children ages 3 years, 0 months to 6 years, 11 months), none of whom was receiving language therapy or had a diagnosed or identified language disorder (Evidence Tables 22 through 25).61 They documented that the sample was representative of US children with respect to race, sex, mother's educational attainment, and geographic region

Reliability

Reports of the CELF-P's reliability were available only in the instrument manual (Evidence Table 23).61

Internal consistency reliability

Wiig et al. reported Cronbach's coefficient alphas by age for the Receptive (r = 0.73 to 0.92) and Expressive Language (r = 0.82 to 0.94) composite scores, and for the Total Language score (r = 0.86 to 0.96).61 With the exception of the Receptive Language composite score for children ages 6-6 to 11-0, alphas met our relaxed criterion for acceptable internal consistency reliability.

Test-retest or intra-rater reliability

Wiig et al. reported test-retest reliability for two age groups: children ages 3-6 years through 3-11 years and 4-6 years through 4-11 years.61 For Receptive Language, test-retest reliability coefficients were 0.93 and 0.87, respectively, for the younger and older age groups. For Expressive Language, the stability coefficients were 0.94 and 0.92; for Total Language, they were 0.97 and 0.93.

Inter-rater reliability

Wiig et al. reported inter-rater reliability as mean percentage agreement (greater than 90 percent).61 Evidence of inter-rater reliability did not meet our threshold for acceptable evidence.

Validity

Data validating the CELF-P were available only in the instrument manual (Evidence Table 24).61

Construct validity

Wiig et al. reported correlations among the CELF-P composite scores, ranging from 0.60 to 0.84.61 The lowest correlation was for the oldest age group; the authors attributed this to ceiling effects. They also conducted a discriminant analysis to determine how well scores on the CELF-P identified children with or without language disorders (based on school system classification); EPC staff calculated the sensitivity, specificity, and PPV and NPV values. Overall agreement ranged from 72 percent to 74 percent depending upon the threshold used with the CELF-P. Sensitivity ranged from 0.84 to 0.93, specificity from 0.65 to 0.69, PPV from 0.48 to 0.60, and NPV from 0.89 to 0.96.

Concurrent validity

Wiig et al. examined the concurrent validity of the CELF-P in relation to five other instruments.61 All data suggested acceptable evidence of concurrent validity. They examined the relationship between the CELF-P and the CELF-R. Correlations for the parallel subtests ranged from 0.27 to 0.84; correlations for Receptive Language and Expressive Language composite scores ranged from 0.63 to 0.86; correlations for the Total Language scores ranged from 0.71 to 0.93. They also reported the overall agreement between the CELF-P and the pass/fail decision from the CELF-R Screening Test; EPC staff calculated the sensitivity, specificity, and PPV and NPV values. Overall agreement was high (75 percent); the sensitivity was 1.00, specificity was 0.74, the PPV and NPV were 0.11 and 1.00, respectively.

Correlations between the composite scores for the CELF-P and the PLS-3 ranged from 0.73 to 0.90. Correlations between the composite scores of the CELF-P and those of the WPPSI-R (i.e., Performance Score, Verbal Score, and Full Scale Score) ranged from 0.45 to 0.72. Correlations between the composite scores of the CELF-P and those of the DAS (i.e., Nonverbal Cluster, Verbal Cluster, and General Conceptual Ability) ranged from 0.53 to 0.70.

Content validity

Wiig et al. described the content of each subtest in terms of the specific syntactic and semantic skills measured.61 They did not provide an explicit model for the CELF-P, but instead stated that the model for the CELF-R served as a template for the CELF-P subtests and items.

Available Normative Data

Wiig et al. provided normative data for the CELF-P (Evidence Table 25).61 The standardization sample closely matched the US population with respect to geographic region, sex, mother's educational attainment, and race or ethnic group.

Applicability of CELF-P to Target Populations

Wiig et al. provided no guidance on the instrument's use with any of the populations of interest to the SSA.61 No individual in the standardization sample was diagnosed with a language disorder or a hearing deficit; all were described as "normal."

Test of Language Development-Primary, 3rd Edition

Table 11. Test of Language Development-Primary, 3rd edition (TOLD-P:3)
AuthorsP.L. Newcomer and D.D. Hammill
PublisherPro-Ed 8700 Shoal Creek Boulevard Austin, TX 78757-6897 512-451-3246 http://www.proedinc.com/
Date of Publication1997
Ages4-0 through 8-11 years
Administration Time30 minutes to 1 hour for core subtests; additional 30 minutes for supplemental tests
ScoresPercentile ranks, age equivalents, and standard scores for core subtests: Picture Vocabulary, Relational Vocabulary, Oral Vocabulary, Grammatic Understanding, Sentence Imitation, and Grammatic Completion; for supplemental subtests: Word Discrimination, Phonemic Analysis, and Word Articulation; and for composites: Listening, Organizing, Speaking, Semantics, Syntax, and Spoken Language.
Normative Data ProvidedData provided at 6-month age intervals between 4-0 and 8-11 years based on standardization sample of 1,000 children from 28 states.
ComponentsExaminer's manual, picture book, profile/examiner record forms
ProceduresThe examiner is seated beside the child at a table. Usually all the core subtests are administered in one session, but the test can be administered over several sessions for young or distractible children. Some items require verbal responses from the child and others require the child to point to appropriate pictures in the picture manual. All subtests are untimed, but the examiner encourages the child to work steadily. Testing begins with the first item in each subtest regardless of the child's age. For each of the core subtests, testing is terminated when the child misses five consecutive items. The examiner scores individual items according to instructions in the manual as each item is administered. The supplemental subtests, if administered, are given during a separate session from the core subtests to avoid child and examiner fatigue. The supplemental subtests are administered only to children less than 7 years of age, unless the child is having specific problems in phonological development.
OtherNone
Earlier VersionsTest of Language Development (1977)112, Test of Language Development Primary (1982)113, Test of Language Development-Primary: 2nd edition (1988)114
The Test of Language Development-Primary, Third Edition (TOLD-P:3), provides professionals with a well-constructed standardized test for assessing children's spoken language (Table 11).69

Key Question No. 1

Newcomer and Hammill standardized the TOLD-P:3 using 1,000 English-speaking children, ages 4-0 through 8-11 years, in 28 states (Evidence Tables 26-30).69 Children with learning disabilities, speech-language disorders and delays, mental retardation, and other handicaps were purposively included to be able to provide reliability and validity information for these groups. They documented that the sample was representative with respect to race, sex, residence (urban or rural), family income, educational attainment of parents, and geographic region.

Reliability

Reports of the TOLD-I:3's reliability were available in the instrument manual.69 Fodness provides test-retest reliability data for the previous version (TOLD-P:2) (Evidence Table 27).70

Internal consistency reliability

Hammill and Newcomer reported Cronbach's alpha coefficients across age groups.69 Correlations (across age) for all subtests ranged from 0.80 to 0.91; for composites they ranged from 0.91 to 0.96. Overall score alphas ranged from 0.95 to 0.96 across age groups. They also reported coefficients by sex, race or ethnicity, and disability subgroups (including children with misarticulation, delayed speech and language problems, learning disabilities, mental retardation, and attention-deficit/hyperactivity disorder [ADHD]). The alphas for these subgroups ranged from 0.80 to 0.97 for subtests and composites.

Test-retest or intra-rater reliability

Evidence of test-retest or intra-rater reliability was available for both the TOLD-P:369 and the TOLD-P:2.70 For the TOLD-P:3 correlations for subtests ranged from 0.77 to 0.90 and correlations for composites ranged from 0.82 to 0.92; only those for the Organizing and Spoken Language composites reached a level greater than 0.90.

Fodness reported correlations for Listening, Speaking, and overall performance composite scores all above 0.82, the level of reliability considered acceptable by the authors based on psychometric research literature.70 The correlations for the Semantic composite scores were below this level for all three age levels. Fodness suggested that the test-retest reliability of the Listening composite score was insufficiently reliable at age 8, and the Phonology composite score also was insufficiently reliable at ages 6 and 8.70 In the TOLD-P:3, Newcomer and Hammill revised all three of the subtests used to compute the Semantics composite score.69 The only composite score in the TOLD-P:3 unaffected by revisions compared to the TOLD-P:2 is the Syntax composite; the former does not include a Phonology composite score.

Inter-rater reliability

Newcomer and Hammill reported inter-rater correlations of 0.99 and higher, suggesting nearly perfect inter-rater reliability.69

Validity

Newcomer and Hammill provided evidence of content (including information on difficulty/discrimination, and test bias), construct, and concurrent validity (Evidence Table 28).69

Construct validity

Newcomer and Hammill extensively documented construct validity; all the evidence exceeds our threshold for acceptable evidence of construct validity.69 Additionally, they provided evidence of construct validity for specific disability groups, including speech and language problems, learning disabilities, mental retardation, and attention-deficit/hyperactivity disorder (ADHD).

First, they observed that mean raw scores were significantly correlated with age for each subtest (range of 0.50 to 0.62 for the six core subtests and 0.32 to 0.55 for the three supplemental subtests). Second, the mean standard scores on subtests for gender subgroups and race or ethnic subgroups were all within the standard error of measurement for each subtest, suggesting that test bias was minimal. Third, the rank order of the mean standard scores on TOLD-P:3 for disability subgroups was what the authors had predicted based on the degree of language disorder generally associated with each type of disability. A fourth type of evidence was that the core subtest raw scores were moderately intercorrelated, with values ranging from 0.37 to 0.59. Finally, the results of a factor analysis suggested that, as hypothesized, the six TOLD-P:3 subtests represented the domain measuring general spoken language ability.

Concurrent validity

Newcomer and Hammill compared TOLD-P:3 scores and Bankson Language Test -- Second Edition (BLT-2) scores.69 Only the TOLD-P:3 Word Articulation subtest failed to correlate with language measures, but this subtest was not used in calculating any of the TOLD-P:3 composite scores. Correlations between TOLD-P:3 subtests and BLT-2 composite scores ranged from 0.59 to 0.97; correlations between TOLD-P:3 and BLT-2 composites ranged from 0.73 to 0.95; correlations between total score was 0.89.

Content validity

Newcomer and Hammill provided extensive evidence of the TOLD-P:3's content validity, developing a conceptual model, testing item difficulty and discrimination, and employing item response theory to examine item bias.69 In doing so, they provide the most extensive evidence among the instruments we reviewed.

Available Normative Data

Newcomer and Hammill presented normative data for 6-month age groupings from 4-0 through 8-11 years (Evidence Table 29).69 The standardization sample closely matched the US population with respect to geographic region, sex, mother's educational attainment, family income, urban versus rural residence, race, ethnicity, and disability status. The standardization sample included sufficiently large (107 to 258) samples of children in each subgroup for which standardization data were reported.

Applicability of TOLD-P:3 to Target Populations

Newcomer and Hammill provided evidence of the applicability of the TOLD-P:3 to several of the targeted populations -- in particular, children with cognitive deficits, learning disabilities, and existing speech-language problems or delays (unlike the TOLD-I:3, the TOLD-P:3 manual does not present separate reliability data for children who are hard-of-hearing.)69

As indicated earlier, Newcomer and Hammill included targeted demographic groups differing in culture, race, and language abilities in the standardization sample. Children with speech-language delay, misarticulation, learning disabilities, mental retardation, and ADHD were included. Hammill and Newcomer reported Cronbach's coefficient alphas were similar for all subgroups (based on sex, racial and ethnic, and disability) and of the same magnitude as for the entire normative sample.71 Further, analyses of item bias indicated that a relatively small proportion of test items were potentially biased with respect to the subgroups examined. In addition, the intercorrelations between the subtest and composite scores were in the same range for demographic subgroups as for the entire normative sample. These results support the authors' position that the test is reliable and valid for each of these subgroups.

Newcomer and Hammill specifically indicated that the TOLD-P:3 should not be given to children who are "deaf" or who do not speak English.69 They also stated that the test should be administered only to children whose chronological ages are between 4-0 and 8-11 years because reliability and validity have not been established for any groups other than those included in the normative sample; alternatively, they suggest different instruments. Thus, mentally retarded children and/or adults not within this age group should not be evaluated using the TOLD-P:3.

Key Question No. 2

We found no literature describing the predictive validity of the TOLD-P:3. Lewis et al. assessed the predictive validity of preschool TOLD-P:2 scores for school-age reading, language, and spelling skills in children identified with moderate to severe speech sound disorders as preschoolers.72 Preschool TOLD-P:2 scores predicted language scores, as measured by the CELF-R, and spelling scores.

They also evaluated the ability of the TOLD-P:2 Word Discrimination subtest and Semantic and Syntactic composite scores to predicting school-age impairments in language, reading, and spelling. The Word Discrimination subtest predicted spelling disorders; the Semantic composite predicted reading; and Syntax composite predicted all three types of disorders.

Test of Language Development-Intermediate, 3rd Edition

Table 12. Test of Language Development-Intermediate, 3rd Edition (TOLD-I:3)
AuthorsD.D. Hammill and P.L. Newcomer
PublisherPro-Ed 8700 Shoal Creek Boulevard Austin, TX 78757-6897 512-451-3246 http://www.proedinc.com/
Date of Publication1997
Ages8-0 to 12-11 years
Administration TimeApproximately 1 hour
ScoresPercentile ranks, age equivalents, and standard scores for subtests: Sentence Combining, Picture Vocabulary, Word Ordering, Generals, Grammatic Comprehension and Malapropisms; and for composites: Listening, Speaking, Semantics, Syntax, and Spoken Language.
Normative Data ProvidedData provided at 6-month age intervals between 8-0 and 12-11 years based on standardization sample of 779 children from 23 states.
ComponentsExaminer's manual, picture book, profile/examiner record forms
ProceduresThe examiner is seated beside the child at a table. Usually all the subtests are administered in one session, but the test can be administered over several sessions for immature or inattentive children. Most subtests require the child to listen and respond verbally to the examiner; the Picture Vocabulary subtest requires the child to listen and point to appropriate pictures in the picture manual. All subtests are untimed, but the examiner encourages the child to work steadily. Testing begins with the first item in each subtest regardless of the child's age. Testing is terminated when the child reaches the respective ceiling for each subtest according to the criteria provided in the manual. The examiner scores individual items according to instructions in the manual as each item is administered.
OtherOptional computerized scoring software available for Windows 3.1 and 95, Macintosh, and IBM DOS systems, providing menu-driven program for scoring test and generating report
Earlier VersionsThe Test of Language Development -- Intermediate (1982)113, and The Test of Language Development -- Intermediate, 2nd edition (1988)114
The Test of Language Development-Intermediate, 3rd Edition (TOLD-I:3), measures the expressive and receptive language abilities of children and provide information about relative strengths and weaknesses (Table 12).71 We identified no evidence for Key Question No. 2 (predictive validity) for this instrument.

Key Question No. 1

Hammill and Newcomer standardized the TOLD-I:3 with 779 English-speaking children, ages 8-0 through 12-11 years, in 23 states (Evidence Tables 31-34).71 Children with learning disabilities, speech-language disorders, mental retardation, hearing impairments, and other handicaps were purposively included to be able to provide reliability and validity information for these groups. Additionally, they documented that the sample was representative with respect to race, sex, residence (urban or rural), family income, educational attainment of parents, and geographic region.

Reliability

Reports of the TOLD-I:3's reliability were available only in the instrument manual.71 Fodness provided test-retest reliability data for the previous version (TOLD-I:2) (Evidence Table 32).70

Internal consistency reliability

Hammill and Newcomer reported Cronbach's coefficient alphas, ranging across age groups.71 Correlations for subtest scores ranged from 0.89 to 0.97; for composite scores alphas ranged from 0.92 to 0.96. They also reported coefficient alphas by sex subgroups, race and ethnic subgroups, and disability subgroups (including children with speech-language problems, learning disabilities, mental retardation, ADHD, and hearing impairment); these alphas ranged from 0.70 to 0.97.

Test-retest or intra-rater reliability

Evidence of test-retest or intra-rater reliability was available for both the TOLD-I:371 and the TOLD-I:2.70 Hammill and Newcomer reported correlations for subtests ranging from 0.83 to 0.93; correlations for composites ranged from 0.94 to 0.96, suggesting an acceptable level of test-retest reliability.71 Fodness et al. reported correlations for the TOLD-I:2 ranging from 0.87 to 0.97, with those for older children (12-year-olds) at the lower end of the acceptable range.70

Inter-rater reliability

Hammill and Newcomer reported inter-rater correlations of 0.94 and higher, thus meeting our threshold of acceptable inter-rater reliability.71

Validity

Hammill and Newcomer provided data on content (including information on difficulty/discrimination, and test bias), construct, and concurrent validity for the TOLD-I:3 (Evidence Table 33).71

Construct validity

Hammill and Newcomer provided extensive documentation of construct validity; all of the evidence exceeded our threshold for acceptable evidence of construct validity.71 Additionally, they provided evidence of construct validity for specific disability groups, including speech and language problems, learning disabilities, mental retardation, and ADHD.

First, they observed that mean raw scores increased with each subsequent age group and were correlated with age for each subtest (range = 0.32 to 0.47). Second, mean standard scores for disability subgroups were significantly lower than those for the overall normative group, indicating that the TOLD-I:3 differentiated between groups with differing language abilities. Third, the subtest raw scores were significantly and moderately correlated, with values ranging from 0.38 to 0.63 (median = 0.54). Correlations were similar for demographic subgroups based on sex, race and ethnicity, and disability status, suggesting construct validity for the subgroups. Fourth, they reported that TOLD-I:3 performance correlated significantly with academic achievement in verbal thinking, speech, reading, and mathematics (but not writing), as measured by the Comprehensive Scales of Student Abilities. (However, the manual provides no reference to this scale.) Finally, using a factor analysis the authors suggested, as hypothesized, that the six TOLD-I:3 subtests represent the domain measuring general spoken language ability 88 percent of total variance explained by single factor.

Concurrent validity

Hammill and Newcomer provided evidence of the TOLD-I:3's concurrent validity, comparing TOLD-I:3 subtest and composite scores with composite scores from the Test of Adolescent and Adult Language -- Third Edition73 (TOAL-3). They reasoned that concurrent validity for the TOLD-I:3 would be demonstrated if the scores correlated with the subtest and composite scores from the TOAL-3 that measure spoken language. The correlations were high, ranging from 0.58 to 0.86 for TOLD-I:3 subtests and 0.74 to 0.88 for composite scores; the correlation between total scores was 0.85.

Content validity

Hammill and Newcomer comprehensively documented evidence of the TOLD-I:3's content validity, developing a conceptual model, surveying experts in the field, testing item difficulty and discrimination, and employing item response theory to examine item bias.71 In doing so, they provide the most extensive evidence among the instruments we reviewed.

Available Normative Data

Hammill and Newcomer presented normative data for the TOLD-I:3 for children ages 8-0 through 12-11 years (Evidence Table 34).71 They documented that the standardization sample closely matched the US population with respect to geographic region, sex, mother's educational attainment, family income, urban versus rural residence, race, ethnicity, and disability status. The standardization sample included sufficiently large (104 to 201) samples of children in each subgroup for which standardization data were reported.

Applicability of TOLD-I:3 to Target Populations

As described earlier for the TOLD-P:3, Hammill and Newcomer provided evidence of the applicability for children who are hard of hearing but not deaf or who have cognitive deficits, learning disabilities, and speech and language problems.71 They purposively included targeted demographic groups differing in culture, race, and language abilities in the standardization sample. Children with speech-language problems, learning disabilities, mental retardation, ADHD, and who were hard of hearing were included and reliability data provided for them. They specified that the test should not be used with children who are deaf or who do not speak English.

Preschool Language Scale, 3rd Edition (English)

Table 13. Preschool Language Scale, 3rd Edition (PLS-3)
AuthorsI.L. Zimmerman, V.G. Steiner, R.E. Pond
PublisherThe Psychological Corporation P.O. Box 839954 San Antonio, TX 78283-3954 1-800-228-0752 http://www.psychcorp.com/
Date of Publication1992
AgesBirth through 6-11 years
Administration Time15-30 minutes
ScoresStandard scores, percentile ranks, and age equivalents for Auditory Comprehension, Expressive Communication, and Total Language Scores
Normative Data ProvidedBy age (13 different age intervals), compared to general population sample
ComponentsExaminer's manual with picture and reading plates, record forms, toys, and objects.
ProceduresExaminer begins testing on items approximately one year below child's chronological age or expected language age. The basal is then established. Testing continues until ceiling is achieved. Behaviors are elicited as responses to pictures, objects, or verbal stimuli, or a combination. Responses are scored during test administration according to criteria provided in the examiner's manual.
OtherSupplemental measures include Articulation Screener, Language Sample Checklist, and Family Information and Suggestions Form.
Earlier VersionsPreschool Language Scale (1969),115 Preschool Language Scale-Revised Edition (1979)79
The Preschool Language Scale, 3rd Edition (PLS-3) is designed to evaluate the receptive and expressive language skills, as well as precursors of these skills, in infants and young children (Table 13).74 We identified no evidence pertaining to Key Question No. 2 (predictive validity) for this instrument.

Key Question No. 1

Zimmerman et al. standardized the PLS-3 using 1,200 English-speaking children, ages 2 weeks through 6-11 years, in 40 states and the District of Columbia (Evidence Tables 35-38).74 They excluded children previously identified as language disordered or receiving language remediation services following the diagnosis of a language disorder and those who had difficulties at birth, including those who did not go home from the hospital with the mother, who had a hospital stay of more than 1 week, or a significant birth or genetic defect. They documented that the standardization sample closely matched the US population with respect to geographic region, sex, mother's educational attainment, and race or ethnic group.

Reliability

Zimmerman et al. reported data for internal consistency, test-retest, and inter-rater reliability (Evidence Table 36).74 Internal consistency reliability data are reported separately for each age interval for which standard scores are provided.

Internal consistency reliability

Zimmerman et al. reported Cronbach's coefficient alphas of 0.47 to 0.88 for the Auditory Comprehension subscale, 0.68 through 0.90 for the Expressive Communication subscale, and 0.74 to 0.94 for the Total Language score.74 Alphas at the lower ends of these ranges tended to be for children less than 1 year of age, suggesting unacceptable reliability for these groups.

Test-retest or intra-rater reliability

Zimmerman et al. reported correlation coefficients for the Auditory Comprehension subscale ranging from 0.89 to 0.90, 0.82 to 0.92 for the Expressive Communication subscale, and 0.91 to 0.94 for the Total Language score.74 Test-retest reliability results are limited in that no children younger than 3 years of age were included; thus, the PLS-3's test-retest reliability for measuring the language performance of younger children is unknown.

Inter-rater reliability

Zimmerman et al. reported inter-rater reliability of 0.98 for the total score; they did not present data for inter-rater reliability for the two subscales.74 Like test-retest reliability discussed earlier, the evidence for the PLS-3's inter-rater reliability was limited to children 3 years of age and older.

Validity

Reports of validity data were available for the both the PLS-374 and the previous version of the instrument (Evidence Table 37).75,76

Construct validity

Evidence of construct validity was available for both the PLS-374 and the PLS-R.75, 76 Zimmerman et al. examined the correlation between the Auditory Comprehension and the Expressive Communication subscales.74 Across all age ranges, the correlation was 0.64, indicating that the two subscales share considerable variance, but also each measured something unique. They also compared how well the PLS-3 discriminated between 3-, 4-, and 5-year-old children with identified language disorders and those without. Using a cut-off score of 1.5 standard deviations below the mean to determine the presence of language disorder from the PLS-3 performance, Zimmerman et al. reported the overall level of agreement ranging from 66 percent to 80 percent; EPC staff calculated sensitivity, specificity, and PPV and NPV for each of the three groups.74 Sensitivity ranged from 0.91 to 1.00, specificity from 0.60 to 0.72, PPV from 0.36 to 0.61, and NPV from 0.96 to 1.00. For each age group, the primary type of error was a "false negative" (i.e., children diagnosed as language disordered using the criteria of their respective states but who were not classified as language disordered based on their PLS-3 performance).

Berryman presented evidence related to the construct validity of the PLS-R for a sample of 672 children, identifying five test items that appeared to have questionable age-level placements and an age grouping that yielded lower passing rates than the next older age group.76 To address these issues, Zimmerman et al. conducted field testing with 451 children, deleting and revising questions and reordering by difficulty within and across age levels.74 Berryman also correlated PLS-R auditory comprehension and verbal ability (r = 0.72); McLoughlin and Gullo reported a correlation of 0.58 between these subscales.75 The difference in results between these two studies may be attributed to the composition of the two study samples.

Concurrent validity

Evidence of concurrent validity is available for both the PLS-374 and the PLS-R.75, 77

Zimmerman et al. conducted three evaluations of concurrent validity for the PLS-3 (results are reported in the instrument manual).74 They compared scores from the PLS-3 and the Denver II;78 all children were given a "normal" rating based on the Denver II results and fell within 1.5 standard deviation of the mean on the PLS-3. They also compared PLS-3 scores to PLS-R scores for a group of 3-year-olds.79 Correlations were 0.66 for comprehension subscales, 0.86 for expressive subscales, and 0.88 for total scores. Zimmerman et al. also compared the PLS-3 and the CELF-R62 in children aged 5-0 years to 6-11 years. They reported a correlation of 0.69 for comprehension subscales, 0.75 for expressive subscales, and 0.82 for total scores.

McLoughlin and Gullo examined intercorrelations of the PLS-R, the Test of Early Language Development (TELD),80 and the Peabody Picture Vocabulary Test-Revised (PPVT-R).75, 81 The correlation between PPVT-R and PLS-R total scores was 0.73; that between the TELD and the PLS-R total scores was 0.52.

Pecyna Rhyner and Bracken reported on moderate intercorrelations among PLS-R and the Bracken Basic Concept Scale82 (r = 0.40) and the Slossen Intelligence Test (SIT)65 (r = 0.35).77 Taken together, these studies suggest that the PLS-3 has acceptable concurrent validity.

Content validity

Zimmerman et al. document content validity by describing how the instrument samples behaviors related to attention, semantics (vocabulary and concepts), structure (morphology and syntax), integrative thinking skills, vocal development, and social communication.74 They indicated that the content of the PLS-3 test items is based on the literature on typical language development of young children.

Available Normative Data

Zimmerman et al. provided normative data by age group for children from birth through 6-11 years of age (Evidence Table 38).74 The standardization sample, while representative of the US population, included only 48 to 51 children in each of the four age subgroups between birth and 1 year; this may explain some of the lower reliability values for these groups.

Applicability of PLS-3 to Target Populations

Zimmerman et al. provided little guidance on the PLS-3's applicability to the populations of interest to the SSA.74 They indicated that an examiner may alter PLS-3 administration of the PLS-3 for use with children who have severe developmental delays, severe physical impairments, or hearing impairments, but they also indicate that the norms may not be applied to those children. They specifically excluded children with language disorders or birth conditions that might predispose them to cognitive impairments; thus, no data are available on the applicability of the instrument with these populations. As for non-English-speaking children, no translations or language-specific version are available other than the PLS-3 Spanish Edition, which is described in the next section.

One study evaluated the use of the PLS-3 in Native American children.83 Long found that standard scores for the Native American children did not differ significantly from the US population norms, but 5-year-old children attained significantly lower standardized scores than younger children. From her data, Long concluded that the PLS-3 is appropriate for use with the younger Native American children represented in her sample, but that for 5-year-old Native American children, a performance below the normal range may not indicate a language disorder or delay.

Preschool Language Scale, 3rd Edition (Spanish)

Table 14. Preschool Language Scale-3rd Edition (Spanish) (PLS-3Sp)
AuthorsI.L. Zimmerman, V.G. Steiner, and R.E. Pond
PublisherThe Psychological Corporation 555 Academic Court San Antonio, TX 78204 Voice 1-800-872-1726 Fax 1-800-232-1223 http://www.psychcorp.com/
Date of Publication1993
AgesBirth through 6-11 years
Administration Time20 to 30 minutes
ScoresStandard scores, percentile ranks, and age equivalents for Auditory Comprehension, Expressive Communication, and Total Language Scores (based on standardization sample used for PLS-3)
Normative Data ProvidedScores are determined based on standardization sample for PLS-3 (English version). These are provided by age (13 different age intervals) based on English-speaking U.S. population sample of 1,200 children. Spanish Administration Directions manual includes summary of data from 181 Spanish-speaking children in the United States, comparing ages at which children responded correctly to tasks on PLS-3 English and Spanish editions.
ComponentsSpanish administration directions, record forms, PLS-3 picture book, PLS-3 examiner's manual.
ProceduresExaminer begins testing on items approximately one year below child's chronological age of expected language age. The basal is established. Testing continues until ceiling is achieved. Behaviors are elicited as responses to pictures, objects, or verbal stimuli, or a combination. Responses are scored during test administration according to criteria provided in the Spanish Administration Directions manual.
OtherNone
Earlier VersionsNone
The Preschool Language Scale-3, Spanish Edition(PLS-3Sp) is an adaptation of the PLS-3 (Table 14).84 We identified no evidence pertaining to test-retest or intra-rater reliability, concurrent validity, current validity or predictive validity (Key Question No. 2) for this instrument.

Key Question No. 1

Zimmerman et al. standardized the PLS-3Sp using 181 Spanish-speaking children ages 0-0 to 6-11 years from six states and Puerto Rico (Evidence Tables 39-42).84 Although the children spoke a variety of dialects, the authors indicated that the sample was not representative of Hispanic children in the United States.

Reliability

Zimmerman et al. presented data only for internal consistency reliability (Evidence Table 40).84

Internal consistency reliability

Zimmerman et al. reported Cronbach's coefficient alphas ranging from 0.38 to 0.94 for the Auditory Comprehension subscale, 0.33 to 0.92 for the Expressive Communication subscale, and 0.39 to 0.95 for the Total Language scale.84 The evidence is limited by the small sample sizes (i.e., n = 10 to 30) employed for children of different ages.

Validity

Zimmerman et al. provided only construct validity data (Evidence Table 41).84

Construct validity

Zimmerman et al. provided data on the age at which 50 percent, 75 percent, and 90 percent of English-speaking and Spanish-speaking children passed each item on the PLS-3 and PLS-3Sp.74, 84 They concluded that Spanish-speaking children acquired the language skills targeted by the test, but the timing and sequence of skill acquisition appeared to differ somewhat from children who are English-speaking.

Available Normative Data

Normative data from the PLS-3 were used to determine standardized scores for the PLS-3Sp (Evidence Table 42). Zimmerman et al. specified that these norms must be interpreted with caution because they were not derived from representative samples of Spanish-speaking children.84

Applicability of PLS-3Sp to Target Populations

Although the PLS-3Sp was designed for use with Spanish-speaking children, the authors advised caution when interpreting the results, given that they were not derived from a representative sample of Spanish-speaking children. Zimmerman et al. specifically indicated that test subjects must not be deaf; thus, the PLS-3Sp is unlikely to be applicable to Spanish-speaking children with hearing impairments or who are hard of hearing.84 They also provided no guidance on use with children with learning disabilities or mental retardation.

Test of Pragmatic Language

Table 15. Test of Pragmatic Language (TOPL)
AuthorD. Phelps-Terasaki and T. Phelps-Gunn
PublisherPro-Ed 8700 Shoal Creek Boulevard Austin, TX 78757-6897 512-451-3246 http://www.proedinc.com/
Date of Publication1992
AgesKindergarten (age 5-0 years) through high school age, and with adult remedial populations, aphasics, and English-as-a-second language (ESL) populations.
Specifically designed to be used with children and adolescents with: learning disabilities, language delays or disorders, reading difficulties; in ESL programs in family therapy or substance abuse treatment programs; and adults with learning disabilities, aphasia, or in ESL programs.
Administration Time30-45 minutes
ScoresRaw scores, percentile ranks, the TOPL quotient, and age equivalents
Normative Data ProvidedBy age (6-month intervals from 5-0 through 13-11 years) based on examinees of persons purchasing tests from Pro-Ed and individuals who had assisted in the development of other Pro-Ed tests. Normative sample is representative of school-age children.
ComponentsExaminer's manual, the TOPL Picture Book, Profile/Examiner Record Form.
ProceduresExaminer reads the stimulus prompt (which may or may not require the examiner to show a picture) as written on the Record Form. The examiner scores all responses as 1 (correct) or 0 (incorrect). The Record Form provides examples of scoreable responses and examples of correct and incorrect responses. The total number of correct responses is calculated by summing across all items. This total then is used to obtain the standard score and percentile. The session may be audiotaped and transcribed/scored later.
OtherNone
Earlier VersionsNone
The Test of Pragmatic Language (TOPL) provides an in-depth screening of the effectiveness and appropriateness of a student's pragmatic or social language skills (Table 15).85 We found no evidence pertaining to test-retest or intra-rater reliability or predictive validity (Key Question No. 2) for this instrument.

Key Question No. 1

Phelps-Terasaki and Phelps-Gunn standardized the TOPL using a sample of 1,096 children from 24 states and one Canadian province (Evidence Tables 44-46).85 They documented that it was representative with respect to gender, residence (urban or rural), race, geographic region, and ethnicity of US children ages 5 through 13.

Reliability

The instrument manual reported data on only internal consistency and inter-rater reliability (Evidence Table 44).85

Internal consistency reliability

Cronbach's coefficient alpha ranged from 0.74 to 0.89 (overall = 0.82) across age groups. Alphas for all groups except children 6 years of age exceeded the "relaxed" threshold.

Inter-rater reliability

Phelps-Terasaki and Phelps-Gunn reported inter-rater reliability of 0.99.85

Validity

Reports of validity data were available in the TOPL manual (Evidence Table 45).85

Construct validity

Phelps-Terasaki and Phelps-Gunn employed principles of item response theory and tested a priori hypotheses based on common understanding of pragmatic language development to provide evidence of construct validity.85 Reported item discrimination coefficients fell within an acceptable range (ranging from 0.22 to 0.49) as did item difficulty for four of nine age groups, with statistics for other groups exceeding the acceptable range by only a small amount. TOPL scores increased with age as expected and can differentiate (i.e., is responsive) between children with and without language disorders. The available evidence suggests acceptable construct validity.

Concurrent validity

Phelps-Terasaki and Phelps-Gunn cited the results of two studies conducted by other researchers as evidence of concurrent validity language for the TOPL.85-87 The primary goal of these studies was to evaluate the validity of other instruments, not to validate the TOPL; thus, they were not included in this review. Both studies reported high correlations between the TOPL and other language measures (r = 0.32 [mathematics] to 0.70 [language]86) and general mental ability (r = 0.68).87 They also correlated teacher ratings with TOPL scores (r = 0.82), providing additional good evidence of concurrent validity. The available evidence suggests acceptable concurrent validity.

Content validity

Phelps-Terasaki and Phelps-Gunn based the instrument on a conceptual model drawn from the literature on pragmatic models of language development.85 They employed modern psychometric theory to select items for the instrument.

Available Normative Data

Phelps-Terasaki and Phelps-Gunn provided normative data and documented that the standardization sample closely matched the sex, residence (urban or rural), race and ethnicity, and geographic region of US children (Evidence Table 46).85

Assessment of Intelligibility of Dysarthric Speech

Table 16. Assessment of Intelligibility of Dysarthric Speech (AIDS)
AuthorK.M. Yorkston and D.R. Beukelman
PublisherPro-Ed 8700 Shoal Creek Boulevard Austin, TX 78757-6897 512-451-3246 http://www.proedinc.com/
Date of Publication1984
AgesNot specified
Administration TimeNot specified
ScoresPercentage of words correct for single word task, speaking rate (words per minute), rate of intelligible speech (number of intelligible words per minute), and communication efficiency ratio (rate of intelligibility of the dysarthric individual compared to that of normal speakers) for sentence intelligibility task
Normative Data ProvidedEfficiency ratio reported for comparison with normals
ComponentsExaminer's manual and picture book. A computer version, which includes a clinician manual and two computer disks, is also available.
ProceduresIn the single-word task, the dysarthric individual is audiotaped as he or she produces a series of 50 single words selected randomly from a master word pool. The examiner or listener judges the sample in one of two response formats -- multiple choice or transcription. In the multiple choice format, the listener selects the word that has been spoken from a list of 12 similar sounding words. In the transcription format, the listener writes down that word that has been spoken. In both formats, the results are reported as the percentage correct.
In the Sentence task, the dysarthric individual is audiotaped as he or she produces series of sentences (5 to 15 words long for a total of 220 words) selected randomly from a master pool of 100 sentences for each sentence length. The judging format is word-by-word transcription. The results are reported in percentage correct.
The speaking rate (words per minute), rate of intelligible speech (number of intelligible words per minute), and communication efficiency ratio (rate of intelligibility of the dysarthric individual compared to that of normal speakers) are calculated from the Sentence Intelligibility task.
OtherNone
Earlier VersionsNone
The Assessment of Intelligibility of Dysarthric Speech (AIDS) measures intelligibility and speaking rate in children and adults with dysarthria, quantifying single word and sentence intelligibility and speaking rate (Table 16).23

Key Question No. 1

Yorkston and Beukelman standardized the AIDS using a small sample (n = 12) of dysarthric speakers (Evidence Tables 47-49); they did not provide demographic information about the subjects.23 We identified no evidence on internal consistency, construct validity, concurrent validity or predictive validity (Key Question No. 2).

Reliability

Yorkston and Beukelman provided data on only the intra- and inter-rater reliability of the AIDS (Evidence Table 48).23

Test-retest or intra-rater reliability

Yorkston and Beukelman reported correlations for the single-word task, multiple choice, and transcription formats of 0.90 and 0.87, respectively.23 They also reported correlations for intelligibility percent of 0.96 to 0.99 and 0.99 for rate of intelligible speech. With the exception of the single word transcription format, the AIDS met our threshold for acceptable test-retest or intra-rater reliability.

Inter-rater reliability

Yorkston and Beukelman reported correlations for single-word intelligibility ranging from 0.88 to 0.99, depending upon the scoring method; for sentences, correlations ranged from 0.93 to 0.99 for intelligibility and 0.99 for rate of intelligible speech.23 Like the evidence for test-retest or intra-rater reliability, only the single-word transcription format did not meet our threshold for acceptable intra-rater reliability.

Validity

Yorkston and Beukelman reported only content validity data (Evidence Table 49). No data were reported for either construct or concurrent validity.23

Content validity

Yorkston and Beukelman argued that measurement of intelligibility of dysarthric speech has face validity: "the more understandable the speaker, the better able he is to function in communicative settings."23 They stated that intelligibility measures are easily communicated to family members and members of the rehabilitation team and can be easily understood.

Available Normative Data

Normative data per se were not available for the AIDS (Evidence Table 50). Instead, Yorkston and Beukelman provide an efficiency ratio (for comparison with "normal speech") derived from 20 subjects whose characteristics were not reported.23

Dysarthria Examination Battery

Table 17. Dysarthria Examination Battery (DEB)
AuthorsS.S. Drummond
PublisherCommunication Skill Builders The Psychological Corporation 555 Academic Court San Antonio, TX 78204 Voice 1-800-872-1726 Fax 1-800-232-1223 http://www.psychcorp.com
Date of Publication1993
AgesAdults and children (ages not specified, none of validation studies performed with children)
Administration TimeApproximately 1 hour
ScoresProvides thresholds for abnormality for each of the 21 quantitative tasks and 15 rating scale tasks (Table 3.3). Thresholds for abnormality for the quantitatively scored tasks are based on the literature and on data on administration to 62 dysarthric subjects. For rating scale items, responses of "2" through "5" are considered abnormal. The total number of tasks scored as abnormal is used to determine the severity and recommended treatment for dysarthria (Table 3.4). Percent measures also provided to determine severity/recommended treatment in the event that fewer than 36 tasks are administered.
Normative Data ProvidedNone
ComponentsInstructions for administration, examiner's manual, stimulus cards, scoring form. Requires the use of a stopwatch, audiotape recorder, dry spirometer, flashlight, tongue depressor, laryngeal mirror, bite block, swab and speech analysis equipment and software.
ProceduresThe examiner administers the 23 tasks as described in the instrument manual, using any additional equipment required. The examiner records values for quantitative tasks and compares them to the norms described in the manual. Rating scale tasks are rated on a scale from 1 (normal) to 5 (severe abnormality according to task). The total number of abnormal performances on the 23 tasks in cumulated and compared to Table 3.4 in manual to determine severity and for treatment recommendations. Rating scale tasks may be audiotaped for later scoring.
OtherNone
Earlier VersionsThree unpublished versions
The Dysarthria Examination Battery (DEB) evaluates dysarthria in adults and children (Table 17).88 Developed and implemented for routine clinical use at the University of Arkansas Medical Center, the DEB takes an anatomic-physiological approach to evaluating the function of five speech processes -- respiration, phonation, resonation, articulation, and prosody. We identified no evidence on internal consistency, reliability, concurrent validity, content validity, normative data or predictive validity (Key Question No. 2).

Key Question No. 1

Drummond standardized the DEB using 20 men and women who were either "mildly" dysarthric or normal (nondysarthric and not brain-damaged) and treated in a university medical center clinic (Evidence Tables 51-53).88 Additionally, she used data from 34 or 38 individuals described in three other studies to assess reliability and validity (determining the actual number of individuals with the information given in the other three studies is difficult). No information was provided with which to assess the generalizability of results to other dysarthric individuals.

Reliability

Reliability data are available only in the instrument manual; the manual briefly describes data presented at a conference, as a thesis, and in several studies, which all had been excluded from review based on small sample sizes. Reliability data (Evidence Table 52) are reported for intra- and inter-rater reliability only.

Test-retest or intra-rater reliability

Drummond reported mean correlations (across the measures) and examiner experience ranging from 0.67 to 0.81.88 Inexperienced examiners had lower intra-rater reliability with correlations ranging from 0.67 to 0.69. The evidence did not meet our threshold for acceptable test-retest or intra-rater reliability.

Inter-rater reliability

Drummond reported mean correlations ranging from 0.61 (nasality) to 0.98 (lingual lateralization), with a mean correlation of 0.90.88 Correlations generally were lower for inexperienced raters. This evidence did not meet our threshold for acceptable inter-rater reliability.

Validity

Drummond reported construct validity data only (Evidence Table 53).88

Construct validity

Drummond combined the results from her standardization sample with data from three studies described very briefly in the manual; little information was provided on the design of these studies and none was published in the peer-reviewed literature. She offered no statistical data to support her conclusion that the instrument has construct validity.

Stuttering Severity Instrument for Children and Adults, 3rd Edition

Table 18. Stuttering Severity Instrument, 3rd Edition (SSI-3)
AuthorG.D. Riley
PublisherPro-Ed 8700 Shoal Creek Boulevard Austin, TX 78757-6897 512-451-3246 http://www.proedinc.com/
Date of Publication1994
Ages2 years, 10 months through adult
Administration TimeNot specified
ScoresPercentile ranks
Normative Data ProvidedBy age (three different age intervals) based on clinical sample of stutterers
ComponentsExaminer's manual with picture and reading plates, record forms.
ProceduresExaminer elicits at least two speech samples from individual consisting of at least 200 syllables per sample. One sample is recorded at home if possible. For individuals with reading level at 3rd grade or above, one sample is elicited via reading. Samples are audiotaped or videotaped for later analysis. Examiner computes percentage of syllables stuttered and mean duration of three longest stuttering episodes. In addition, examiner rates physical concomitants of stuttering.
OtherAdjectives describing stuttering severity with different percentile scores
Earlier VersionsStuttering Severity Instrument for Children and Adults (1972);116 Stuttering Severity Instrument for Children and Adults, Revised (1981)91
The Stuttering Severity Instrument for Children and Adults, 3rd Edition (SSI-3) provides an objective description of stuttering severity in children and adults (Table 18).89 We identified no evidence for this instrument on internal consistency reliability, content validity or predictive validity (Key Question No. 2).

Key Question No. 1

Riley standardized the SSI-3 using 271 children and adults (ages 2 to 17+) from California (Evidence Tables 54-57).89 Subjects under age 8 had no prior treatment for stuttering whereas most of those older than 8 had received treatment. He made no attempt to assess whether the sample was representative of US children and adults who stutter. Thus, the results he reported may not be generalizable to individuals with characteristics different from the standardization sample.

Reliability

Reports of reliability are available only in the instrument manual (Evidence Table 55).89

Test-retest or intra-rater reliability

Riley evaluated intra-rater reliability, reporting only percentage agreement ranging from 72 percent to 96 percent depending upon the parameter measured and the group of judges.89 Thus, the SSI-3 did not meet our threshold for acceptable test-retest or intra-rater reliability.

Inter-rater reliability

Riley evaluated inter-rater reliability, reporting percentage agreement between raters that range from 81 percent to 96 percent depending upon the parameter measured and the group of judges.89 These data did not meet our criterion for acceptable inter-rater reliability.

Validity

Riley reported data on the SSI-3's construct and concurrent validity (Evidence Table 56).89

Construct validity

Riley tested hypotheses that duration and physical concomitant scores would increase with age and that frequency would decrease with age.89 The correlations (all significant at p < 0.01) behaved as hypothesized; duration correlations increased from 0.73 to 0.77 and physical concomitant score correlations increased from 0.68 to 0.77. Frequency score correlations declined from 0.83 to 0.74. This evidence met our criterion for acceptable construct validity.

Concurrent validity

Riley provided indirect evidence of concurrent validity by comparing correlations between the SSI-3 total overall score and the SSI-3 frequency score (r = 0.74 to 0.83) and to the correlation reported by Yaruss and Conture (1992) (r = 0.072), who compared the SSI-Revised Edition to the Stuttering Prediction Instrument.89-91

Available Normative Data

Riley reported limited normative data for the SSI-3 (Evidence Table 57). These norms were derived from a relatively small, geographically similar sample, and Riley made no comparison between the standardization sample and the stuttering population in the United States. Thus, the results from the standardization and the resulting norms may not be generalizable to US children and adults who stutter.89

Goldman-Fristoe Test of Articulation, 2nd Edition

Table 19. Goldman-Fristoe Test of Articulation, 2nd Edition (GFTA-2)
AuthorR. Goldman and M. Fristoe
PublisherAmerican Guidance Services, Inc. 4201 Woodland Road Circle Pines, MN, 55014-1796 800-328-2560 http://www.agsnet.com
Date of Publication2000
AgesAges 2-0 through 21-11 years
Administration Time5-15 minutes
ScoresStandard scores with 90 percent and 95 percent confidence intervals, percentile ranks, and test-age equivalents
Normative Data ProvidedBy age (1-month intervals from ages 2-0 through 21-11 years) and gender based on a large national sample of children for Sounds-in-Words subtest only. Normative sample is representative of gender, race/ethnicity, geographic region, socioeconomic status, and special education.
ComponentsExaminer's manual, easel with 43 GFTA-2 picture plates, and response form.
ProceduresDuring the Sounds-in-Words subtest, the examiner elicits sounds as part of words using the picture plates and the question "what is it?" Examiner records whether the child produced the sound correctly incorrectly, or did not produced the sound at all (Level 1 testing); or child substituted another sounds, omitted the sound, distorted the sound, or added a sound (Level 2 testing). The raw score is the total number of articulation errors. Raw scores are converted to standard scores with 90 percent and 95 percent confidence intervals, percentile ranks, and test-age equivalents using tables in the manual.
During the Sounds-in-Sentences subtest, the examiner reads aloud two picture-based stories. The examinee then retells each story using the picture plates, which illustrate the gist of the story and target words. The recording procedure is similar to that for the Sounds-in-Words subtest and the results are to be compared to those from the Sounds-in-Words subtest. No score is calculated.
During the Stimulability subtest, the examiner uses a set of picture plates in the easel and asks the examinee to watch the examiner's mouth and to listen carefully as the examiner says a syllable, word, or sentence (in that order). The examiner tests only those sounds that were misarticulated during the Sounds-in-Words and/or Sounds-in Sentences subtests and only in the position in which the misarticulation occurred.
OtherNone
Earlier VersionsNone
The Goldman-Fristoe Test of Articulation, 2nd Edition (GFTA-2) evaluates an individual's articulation of the consonant sounds of the Standard American English (Table 19).92 Applications include screening, assessment of the severity of articulation deficit, evaluation of error patterns, and monitoring of growth or progress.

Key Question No. 1

Goldman and Fristoe standardized the GFTA-2 using 2,350 children and adolescents ages 2 through 21 years (Evidence Tables 58-62).92 This sample closely matches the US population with respect to sex, race or ethnic group, geographic region, and mother's education. Additionally, they included representative proportions of students enrolled in special education under different categories of eligibility.

Reliability

Reports of reliability are available for the GFTA-2 and the original version of the instrument (Evidence Table 59).92, 93 Goldman and Fristoe reported reliability data only for the GFTA-2 Sounds-in-Words subtest.

Internal consistency reliability

Cronbach's coefficient alphas ranged from 0.92 to 0.98 (median = 0.96) for females and 0.85 to 0.96 (median = 0.94) for males. With the exception of one or two age groups for boys, the alphas met our "strict" criterion for internal consistency reliability.

Test-retest or intra-rater reliability

Goldman and Fristoe reported percentage agreement (rather than Cohen's Kappa or correlations) for the presence of errors in children age 4 through 7 years.92 The median agreement was 98 percent, ranging from 89 percent to 100 percent for initial position, 79 percent to 100 percent for medial position, and 91 percent to 100 percent for final position. Because they did not use appropriate statistics to measure test-retest reliability, the evidence did not meet our criterion.

Inter-rater reliability

As reported earlier for test-retest reliability, Goldman and Fristoe reported inter-rater reliability as percentage agreement.92 The median percentage agreement was 93 percent (range = 63 percent to 100 percent) on initial sounds, 90 percent (range = 73 percent to 100 percent) on medial sounds, and 90 percent (range = 73 percent to 100 percent) on final sounds.

Two studies reported inter-rater reliability for the original GFTA.72, 93 Seymour and Seymour reported inter-rater reliability for black and white children in Headstart programs; inter-rater reliability surpassed 0.90, but they did not indicate what statistic was used to report reliability (i.e., whether the statistic reflected percentage of agreement or inter-judge correlation).93 Lewis et al. reported inter-rater reliability for children with moderate to severe speech disorders who were enrolled in speech-language therapy; they reported a mean percentage agreement of more than 95 percent for the transcribed responses.72

None of the studies identified employed appropriate statistics to measure inter-rater reliability; thus, the evidence did not meet our criterion for inter-rater reliability.

Validity

Reports of validity data are available for the both the GFTA-292 and the GFTA94 (Evidence Table 60). Goldman and Fristoe report construct, concurrent, and content validity data. Botting and colleagues and Mullen and Whitehead report concurrent validity data for the original instrument.92,95,96

Construct validity

Goldman and Fristoe demonstrated that raw scores decrease regularly with age. This pattern is consistent with what is known about the typical articulation development and closely matches other published reports.92

Concurrent validity

Two studies reported concurrent validity data for the original GFTA.95, 96 Mullen and Whitehead compared correct initial identification of stimulus pictures for the original GFTA and the Arizona Articulation Proficiency Scale (AAPS)97; they reported that children made significantly more errors in their initial identification of the pictures in the original GFTA than the AAPS.96 Goldman and Fristoe dropped 7 of these 20 error-prone stimulus words and revised the artwork to address these errors.92

Botting et al. compared teacher judgements of disability to GFTA scores in 242 British children with language impairment.95 They reported concordance between GFTA scores and teacher opinion; EPC staff calculated the sensitivity, specificity, and PPV and NPV using Botting et al.'s data, using the 25th percentile as a threshold. Sensitivity ranged from 0.67 to 0.74, specificity from 0.77 to 0.85, PPV from 0.59 to 0.79, and NPV from 0.75 to 0.87 depending on the nature of the disorder. These results suggest that concordance between the GFTA and teacher opinion was high.

Content validity

Goldman and Fristoe documented that the GFTA-2 assesses 23 of the 25 recognized consonant sounds and 16 consonant clusters found in Standard American English.92

Available Normative Data

Goldman and Fristoe provided normative data by age and sex for the Sounds-in-Words subtest only (Evidence Table 61).92 They documented that the standardization sample closely matched the US population with respect to sex, race or ethnic group, geographic region, and mother's educational attainment. The standardization sample included sufficiently large (i.e., 100-125) samples of individuals of each sex in each age subgroup for which standardized scores are reported.

Applicability of GFTA-2 to Target Populations

Two studies examined whether scores from the GFTA-2 and the original GFTA differed for different groups of children.92, 93 Goldman and Fristoe reported little difference between Canadian and Standard American English.92 Seymour and Seymour concluded that black children can be successfully tested with the GFTA.93

Although neither of these populations exactly matches the groups of interest to the SSA, the results provide important insights. Goldman and Fristoe included children with mental retardation, speech/language impairments, learning disabilities, and emotional disturbances; reliability, validity and normative data were not provided separately for these groups.92 They also cautioned users that the age-based norms may not be useful with mentally retarded children and adults. Goldman and Fristoe provided no guidance about use with hard-of-hearing children; Lewis et al. required children to have normal hearing acuity.92

Key Question No. 2

We identified no literature describing the predictive validity of the GFTA-2. Lewis et al. reported data on the predictive validity of the original instrument (Evidence Table 62).72 They found that among children identified with moderate to severe speech sound disorders, preschool GFTA scores significantly predicted school-age reading scores but not language or spelling scores.

GRBAS (Grade, Rough, Breathy, Asthenic, Strain) Scale

Table 20. GRBAS (Grade, Rough, Breathy, Asthenic, Strain) Scale
AuthorJapan Society of Logopaedics and Phoniatrics98
PublisherNot specified
Date of Publication1981 (in English-language literature)
AgesNot specified
Administration TimeNot specified
ScoresVocal quality graded on 4-point scale -- 0 (normal), 1 (slight), 2 (moderate), and 3 (severe) -- for each of five voice parameters G (overall grade of hoarseness, R (roughness), B (breathiness), A (asthenic), and S (strained quality).
Normative Data ProvidedNone -- Standardized tape with typical voice samples available from Japan Society of Logopaedics and Phoniatrics
ComponentsGRBAS Scale, high-quality recording device, standardized tape of typical voice samples available from Japan Society of Logopaedics and Phoniatrics (we cannot determine whether this is readily available)
ProceduresVoice samples are recorded. Content of voice sample may vary, coming from spontaneous speech recorded during the session to specific vowel sounds and phonetically balanced samples of various lengths. The listener grades the five voice parameters using a 4-point scale.
OtherDifferent scales ranging from 10-point to visual analog scales have been employed.
Earlier VersionsNone
The GRBAS Scale provides a system for describing vocal quality as measured by five parameters: G (overall grade of hoarseness), R (roughness), B (breathiness), A (asthenic), and S (strained quality) (Table 20 and Evidence Tables 63-65). This scale is not used widely in the United States but it does appear in the US literature and is used in some clinical settings. We found no data on predictive validity (Key Question No. 2) for this instrument.

Key Question No. 1

We were unable to obtain a copy of the original GRBAS Scale and associated documentation. Hirano provided the first English-language description of the scale but provided no information on either reliability or validity.98

Reliability

Reports of reliability were found in seven peer-reviewed articles (Evidence Table 64).99-105 Reliability data were not reported by age, sex, or race; none of the studies provided information on voice disorders in children. All studies were conducted in Europe.

Internal consistency reliability

Two studies examined the internal consistency reliability of the GRBAS Scale.100, 101 However, they did not provide a clear picture of internal consistency reliability because they employed different forms of the GRBAS Scale; de Krom used the original GRBAS Scale and Dejonckere et al. a visual analog scale version.100, 101 Dejonckere et al. also combined two of the parameters, unlike the original version. de Krom failed to report data for two parameters.

Test-retest or intra-rater reliability

Several studies examined test-retest reliability99 and intra-rater reliability100-102,105 of the GRBAS Scale. These studies presented data on the original scale as well as variants, including visual analog scale and ordinal scale formats. They employed different, sometimes nonstandard, statistical methods to assess reliability. None met our threshold for acceptable test-retest or intra-rater reliability (i.e., Cohen's kappas of at least 0.80 or correlations of at least 0.90).

DeBodt et al. reported kappas of 0.40 to 0.45 for the original GRBAS Scale.98, 99 Dejonckere et al. reported intra-rater correlations ranging from 0.68 to 0.89.102 de Krom reported data in graphical displays rather than reporting correlation coefficients or kappas.101 Langeveld et al. reported intraclass correlations, ranging from 0.36 to 0.77.105

Inter-rater reliability

Dejonckere et al. (in two studies), De Bodt et al., de Krom, Millet and Dejonckere, Wuyts et al., and Langeveld et al. all examined inter-rater reliability.99-104 Like the results reported earlier for test-retest and intra-rater reliability, these studies employed different GRBAS Scale versions and applied different, sometimes nonstandard, statistical methods to assess reliability. None met our threshold for acceptable inter-rater reliability (i.e., kappas of at least 0.80 or correlations of at least 0.90).

Validity

We identified several studies evaluating the construct and concurrent validity of the GRBAS Scale (Evidence Table 65).102-105

Construct validity

No data on construct validity were reported in the original English-language introduction of the GRBAS Scale.98 Langeveld et al. presented correlations between the five parameters; the B, R, A, and S parameters were uncorrelated (r = −0.01 to 0.23) and overall grade (i.e., G) was correlated only with S (r = 0.74).105 They concluded that these perceptual characteristics were independent parameters, thus measuring different aspects of voice quality.

Concurrent validity

Two studies compared the GRBAS to acoustical measures or compared the different formats of the GRBAS Scale (ordinal scale versus visual analog scale).102, 104 Although these studies appeared to meet our concurrent validity criterion, they used as comparison instrument tools that have unacceptable reliability, thus negating their claim to validity.

Dejonckere et al. compared the GRBAS scale to the Kay Elemetrics Multi-Dimensional Voice Profile Program.102 They reported that correlations between the GRBAS Scale and MDVP parameters ranged from 0.45 to 0.73. However, their results suggested unacceptable inter-rater reliability (Cohen's kappa = 0.41) and intra-rater reliability (value not reported) for the GRBAS. Wuyts et al. compared the ordinal GRBAS with a visual analog version.104 Although correlations were high (ranging from 0.93 to 0.97), neither version had acceptable inter-rater reliability.

Content validity

Several studies discussed the concept of the difficulty of measuring perceptual data and comparing them to other types of data, but none addressed the comparison of perceptual rating for a vowel extraction sample with a sample normal conversation.

Available Normative Data

No normative data were provided for the GRBAS Scale. A standardized tape with typical voice samples is available from the Japan Society of Logopaedics and Phoniatrics. We were unable to determine whether this tape is readily available in the United States.

Multi-Dimensional Voice Program

Table 21. Multi-Dimensional Voice Program (MDVP)
AuthorsKay Elemetrics Corporation
PublisherKay Elemetrics Corporation 2 Bridgewater Lane Lincoln Park, NJ 07035-1499 USA Voice 1-800-289-5297 Fax 1-973-628-6363 http://www.kayelemetrics.com
Date of Publication1999
AgesNot specified (adults used in validation)
Administration TimeTotal administration time not specified; 16 seconds for the acoustical analyses only.
Scores33 acoustic parameters computed: average fundamental frequency, phonatory frequency range, several frequency and amplitude short- and long-term perturbation and variation measures, noise-to-harmonic ratio, voice turbulence and soft phonation indexes, quantitative measures of voice breaks, subharmonic components and voice tremors.
Normative Data ProvidedPerformance on all acoustic subtests compared to subjects in two descriptive appendices. Study One, the intersystem reliability study, compared 21 male and 21 female normal adult speakers with 21 male and 21 female patients with a voice disorder. Study Two evaluated 68 subjects with normal and disordered voices. A CD-ROM includes patient information, MDVP analysis, and signal profiles for 700 disordered voices.
ComponentsMDVP is an optional software program purchased with Multi-Speech or CSL computer hardware and software. The CSL and Multi-Speech hardware and software are not described in the MDVP manual (they have separate manuals), but the "generic" sound card is mentioned as being inferior to the "professional level" sound input/output systems of CSL.
ProceduresThe sample data extraction method recommends that the subject say a sustained "ah" at a flat tone for at least 3 seconds. Later in the manual the Recommendations from the National Center for Speech and Voice are mentioned, stating that more than one "token" (trial) should be used for acoustic analyses. The "Tips" section recommends that the examiner elicit sustained voice and a reading passage. After a sound is elicited and recorded, the examiner saves the acoustic signal with Windows toolbars or computer function keys. Extracted data are displayed as numerical data for each of the acoustic parameters. Additionally, seven graphic displays are presented. Six present data on captured signal; average fundamental frequency (Fo) and peak-to-peak amplitude; long-term average spectrum; waveform (window A); Fo and amplitude modulation components; and histogram of Fo and amplitude displays (windows E and F). For 19 of the 33 parameters, a radial graph displays the patient's parameters (window D). The display is scaled so that normative thresholds for the parameters circumscribe a green circle. Patient parameters falling outside the normative threshold are indicated in red. The radial graph also can be used to compare the patient's parameters to the mean of a group of normal voice samples and against a threshold of 1SD above the normal value.
OtherA model of pathologic voice production is presented in the Appendix.
Earlier VersionsMDVP Model 4305, a DOS program used with earlier versions of the CSL.
The Multi-Dimensional Voice Program (MDVP), Model 5105 is a software tool that provides clinicians and researchers an automatic speech analysis program with a multi-dimensional analysis of voice (Table 21 and Evidence Tables 66-69).106

Key Question No. 1

Reliability

Reliability data for the MDVP, Model 5105 system were available from studies presented in the instrument manual.106, 107 One peer-reviewed article described reliability data for the previous DOS-based MDVP (Evidence Table 67).108

Internal consistency reliability

No internal consistency data were reported for the MDVP because the MDVP does not yield one score or one result. The 33 acoustic parameters are considered separately or in the independent groupings of pitch, noise, and tremor analyses or as frequency perturbation, amplitude perturbation, voice break/irregularity, and noise/tremor analyses. These are not compared to each other because different voice disorders would affect the score of each independently.

Test-retest or intra-rater reliability

Kent et al. reported results of repeated analysis of the same sample (i.e., intra-rater reliability) using the DOS-based MDVP; the system was able to compute analyses for all voice samples and replicate analyses were virtually identical for all but one voice.108 Analyses of different samples from the same voice (test-retest) and for voice samples taken at two different times during the same recording session yielded relatively small discrepancies in parameters and low frequency of discrepancy.

Inter-rater reliability

Inter-rater reliability does not apply to the MDVP.

Validity

Two studies102, 109 and a study reported within the MDVP manual110 described efforts at validating the MDVP (Evidence Table 68). Dejonckere et al.102 and van As et al.109 addressed both construct and concurrent validity and Deliyski110 offered evidence of content validity.

Construct validity

Dejonckere et al. and van As et al. employed principal components analysis to derive evidence of construct validity. Both studies were conducted in Europe using voice samples from patients seen in voice clinics; van As et al. also analyzed "normal" voices.102, 109 Both studies reported that the parameters "statistically" loaded on the same factors. Unlike Dejonckere et al., van As et al. reported demographic characteristics, although they evaluate only men's voices.102, 109 (The MDVP program can be set to measure female voices, giving different results.)

Concurrent validity

Deliyski and Gress compared the previous DOS version with the current Windows version; they called their analysis inter-system reliability, although it really provides evidence of concurrent validity.107 They reported correlation coefficients of greater than 0.99 for all measured parameters.

Two studies compared the MDVP to perceptual measures of voice quality. However, both studies suffered from important methodological deficits. Dejonckere et al. correlated MDVP parameters and the GRBAS Scale; they reported that the inter-rater reliability (Cohen's kappa = 0.41 to 0.51) and intra-rater reliability (value not reported) of the GRBAS were low.102 van As et al. compared MDVP parameters to an undescribed set of "seven semantic scales" describing voice in a perceptual task, providing no evidence of the reliability and validity of the perceptual rating scale.109

Thus, evidence of concurrent validity of the MDVP compared to the previous version was excellent, but no evidence exists when the MDVP was compared with perceptual rating scales.

Content validity

Deliyski presented an acoustic model of disordered voice production and used it to justify the 33 computed acoustic parameters and the subsequent grouping of the parameters into pitch, noise, and tremor analyses.110

Available Normative Data

Deliyski presented normative values for the MDVP parameters (Evidence Table 69).110 However, he cautioned MDVP users that normative values may depend upon the individuals included in the study, and thus they may be generalizable only to individuals similar to those tested in the study.

Applicability of MDVP to Target Populations

We identified no studies that address the applicability of the MDVP to populations of interest to the SSA. Of particular importance is that none of the studies or the test manual reported analyses of children's voices; thus, no data are available on the applicability of the MDVP to children with voice disorders.106

Voice Handicap Index

Table 22. Voice Handicap Index (VHI)
AuthorB.H. Jacobson, A. Johnson, C. Grywalski, A. Silbergleit, G. Jacobson, M.S. Benninger, C.W. Newman
PublisherB.H. Jacobson Division of Speech-Language Sciences and Disorders Department of Neurology Henry Ford Hospital 2799 W. Grand Boulevard Detroit, MI 48202
Date of Publication1997
AgesNot specified (adults used in validation)
Administration TimeNot specified
ScoresThree 10-item subscales (functional, emotional, and physical) with scores from 0 to 40 points. Total score is the sum of the three subscales (range 0 to 120).
Normative Data ProvidedNone
ComponentsVoice Handicap Index
ProceduresSubjects read each of the 30 items and rate their response as "never," "almost never," "sometimes," "almost always," and "always" to indicate how they frequently have the same experience. Responses are scored on an equal-appearing 5-point scale from "never" scored as 0 to "always" scored as 4. Functional, emotional, and physical subscales are calculated as the sum of the responses to the 10 items in each scale. Total Voice Handicap Index is the sum of the scores on the three subscales.
OtherNone
Earlier VersionsNone
The Voice Handicap Index (VHI) provides clinicians and researchers an instrument for measuring the impact on everyday functioning and health-related quality of life of voice disorders as perceived by the individual with the voice disorder (Table 22).25, 26

Key Question No. 1

Benninger et al and Jacobson et al. standardized the VHI using a population of patients attending a voice clinic in a major hospital.25, 26 Although they provided detailed demographic and clinical information about their populations, they made no claim as to whether their populations were representative of US voice patients (Evidence Tables 70-72).

Reliability

Jacobson et al. provided data on the development and evaluation of internal consistency and test-retest reliability for the VHI.26 Because the VHI is a self-administered instrument, no examiner is needed and thus inter- and intra-rater reliability are not relevant. This fact, coupled with the relatively recent development of the instrument, likely are responsible for there being no manual available for review (Evidence Table 71).

Internal consistency reliability

Jacobson et al. documented their use of modern psychometric test theory to develop the VHI, providing a detailed description of the item selection and reduction process.26 They reported a Cronbach's coefficient alpha equal to 0.95 and thus met our "strict" criterion. When correlations between the total VHI score and three subscales (i.e., physical, emotional, functional) are considered (r = 0.88 to 0.93), the VHI met our "relaxed" internal consistency reliability criterion.

Test-retest or intra-rater reliability

Jacobson et al. reported test-retest reliability for the total scale and the three subscales.26 Only the correlations for total score (r = 0.92) and emotional subscale (r = 0.92) exceeded our threshold for test-retest reliability. The functional (r = 0.84) and physical subscales (r = 0.86) did not.

Validity

Two peer-reviewed articles provided data on the VHI's construct and concurrent validity (Evidence Table 72).25, 26

Construct validity

Jacobson et al. reported moderately strong correlations between the functional, emotional, and physical subscales (r = 0.70 to 0.79), suggesting that these subscales measure parts of the same overall construct.26 High correlations between the total VHI score and the subscales (r = 0.88 to 0.93) suggested that each subscale contributes to the overall measure of self-rated voice disorder severity.

Concurrent validity

Both Benninger et al. and Jacobson et al. compared the VHI with different measures of voice disorder severity.25, 26 Jacobson et al. correlated the VHI total and subscale scores to a self-rating of severity of voice handicap (r = 0.60 for total score comparison). Benninger et al. evaluated the relationship between the VHI (total and subscales) and the Medical Outcomes Study Short Form-36 (SF-36). They observed moderately high correlations between the VHI total and subscale scores and the SF-36 social and emotional functioning and mental health domains. VHI total score correlated significantly with SF-36 physical and functional subscales.

Although not direct evidence of concurrent validity, Benninger et al. provided evidence that the health-related quality of life (HRQOL) for voice patients is significantly different from individuals without voice disorders and for individuals with selected chronic diseases.25 Additionally, they provide evidence HRQOL varies among selected subgroups of dysphonia patients.

Taken together, these two studies provide evidence of acceptable concurrent validity.

Supplemental Analyses -- Usability Analysis

This section describes the results of an additional analysis to shed some light on how feasible or practical it is to use these instruments in everyday settings. When deciding which instrument to use, a clinician must evaluate whether the manual provides sufficient information on how to administer and score the instrument. Two reviewers rated the usability of instrument manuals on the following eight criteria:

  1. Instrument administration procedures can be duplicated;

  2. Scoring procedures can be duplicated;

  3. Examiner qualifications are specified;

  4. Required examiner training is documented;

  5. Environmental and equipment requirements are described;

  6. Raw score scale meaning and interpretation are described;

  7. Derived score scale meaning and interpretation are described; and

  8. Scale construction is described.

Table 23. Usability Evaluation Results, by Instrument
InstrumentUsability Criterion aTotal Met b
12345678No.%
Adult Language
BDAE-2+++++/−+5.568.8
PICA++++++++8100
WAB++++/−3.543.8
Child Language
CELF-3++/−++++/−++787.5
CELF-P+++/−+++++7.593.8
CELF-3Sp++++++++8100
PLS-3+++/0++++6.581.2
PLS-3Sp+++/−+/−++++787.5
TOLD-P:3++++++++8100
TOLD-I:3++++++++8100
TOPL++++++++8100
Adult Speech
AIDS++++++++8100
DEB++++/−+/−+/−4.556.2
SSI-3+/−+/−+++450.0
Child Speech
GFTA-2++++++++8100
SSI-3+/−+/−+++450.0
Voicec
MDVP++/−+/−++450.0
% agreement94.182.481.294.110076.582.488.2  
Kappa--- d---0.600.871---0.340.45  
a

+ = Reviewers agreed that manual met criterion;-= Reviewers agreed that manual did not meet criterion; +/− = Reviewers disagreed; 0 = Reviewer did not assess this criterion.

b

Calculated as sum, where + = 1, +/− or +/0 = 0.5 points, and-= 0.

c

Manual not available for GRBAS scale or VHI.

d

SAS could not calculate kappa because of missing cells. When the missing values were replaced with 0.001, kappa values were 0.002, 0.0004, and 0.0002 for criteria 1, 2, and 6, respectively.

For full name of instruments, see Table 4.

We described these methods in Chapter 2; supplemental information appears in Appendix D. Table 23 summarizes the population and disorder results.

Adult Language Disorder Instruments

User manuals varied widely in how well they met our usability criteria. The PICA manual was particularly user-friendly, meeting all eight criteria. Both reviewers agreed that the WAB manual met the least number of criteria of all instruments examined, not just those for adult language. This result is not surprising given the brevity of this manual (seven pages inclusive of references). The BDAE-2 manual ratings were intermediate; both reviewers agreed that information on examiner requirements and training (Criteria 3 and 4) was insufficient for the expected user. The two reviewers disagreed as to whether the derived aphasia classifications and their interpretations were well explained.

Child Language Disorder Instruments

Our reviewers rated the child language instrument manuals as quite user-friendly. In general, these manuals met the majority of the criteria (ranging from 6.6 to 8). Our raters scored the two Preschool Language Scale-3 manuals in similar ways, disagreeing on whether they described examiner qualifications and training (Criteria 3 and 4) in sufficient detail. In a similar vein, the raters disagreed on the examiner qualification criterion for the CELF-P.

Adult Speech Disorder Instruments

The AIDS manual met all eight criteria as evaluated by both reviewers. The SSI-3 met three of the criteria completely (both reviewers agreed) and two partially (i.e., one reviewer rated that the criterion had been met, the other did not). The reviewer who indicated that the SSI-3 manual did not meet Criterion 1 (description of procedures allows duplication of administration) and Criterion 2 (description allows duplication of scoring procedures) noted that procedures in the manual allowed the examiner to interrupt the stutterer to simulate normal conversation; her concern was that this instruction might be interpreted differently by each examiner and during each administration. Thus, the scores obtained might vary. Both reviewers indicated that the SSI-3 manual lacked information on examiner qualifications and training and on environmental and equipment requirements (Criteria 3, 4, and 5) for instrument administration. The DEB fared only slightly better than the SSI-3. Our reviewers disagreed in their assessments for half of the criteria. Both agreed that the administration procedures and scoring instructions could be easily followed and that examiner training had been adequately described. However, one reviewer was very concerned about the fact that many of the measures are subjective and might lead "different examiners to assign ratings differently based purely on their subjection impressions."

Child Speech Disorder Instruments

The GFTA-2 manual met all eight criteria as evaluated by both reviewers. The SSI-3 met three criteria completely (both reviewers agreed) and two partially (i.e., one reviewer rated that the criterion had been met, the other did not). As above, the rater who indicated that the SSI-3 manual did not meet Criteria 1 and 2 was concerned about examiner interruptions. Both raters indicated that the SSI-3 manual lacked information on examiner qualifications and training (Criteria 3 and 4) and on environmental and equipment requirements for instrument administration (Criterion 5).

Voice Disorder Instruments

Our usability reviewers scored only the MDVP instrument manual for the eight criteria. We were unable to rate the usability for the GRBAS Scale because we could not obtain an English-language manual. No manual was available for the VHI; all data on development and evaluation of reliability and validity were reported in the peer-reviewed literature.

The MDVP manual met three criteria (1, 4, and 5) as assessed by both reviewers. Our reviewers disagreed about Criteria 2 and 3. The reviewer who indicated that Criterion 2 was not met noted that although the manual does not report standardized scores, many different authorities claim various standard scores for the parameters tested by the MDVP system. The other reviewer made no comments to shed light on her rating of Criterion 3.

Chapter 4. Conclusions

This chapter begins with a summary of the evidence supporting each of the key questions reviewed in depth in Chapter 3, arrayed by the combination of age group and type of disorder: namely, adult language disorders, child language disorders, adult speech disorders, child speech disorders, and voice disorders. For Key Question No. 1 (i.e., which instruments demonstrate the salient characteristics of a good diagnostic tool), we evaluate the evidence for reliability and validity based on the criteria described in Chapter 2 on methods. We also discuss the normative data available and whether the instruments have been validated in five subpopulations of special interest to the Social Security Administration (SSA). This population-specific focus concerns (1) persons who are English-speaking and have normal hearing, both with and without normal cognition; (2) persons who are not English-speaking and have normal hearing, both with or without normal cognition; (3) persons who mentally retarded; (4) persons with learning disorders; and (5) persons who are hard of hearing. We then assess the evidence for predictive validity for future communicative impairment and performance (Key Question No. 2) and address similar population-specific issues.

We close the chapter by describing the limitations of this evidence report; we address both the constraints specific to the process of instrument selection and literature search and retrieval and the drawbacks of the literature and evidence base associated with the 18 selected instruments. We also reflect on deficiencies of this overall body of evidence for addressing SSA's main policy and programmatic concerns. In discussing the limitations of the available evidence, we first address those that are common across the instruments and then focus on issues that are peculiar to the instruments or to a given population or category of disorder. Chapter 5 builds on these discussion points to propose a research agenda that we judge important to bring to the attention of federal agencies, foundations, or other groups concerned with clinical and health services research in this area.

Key Question No. 1: Instrument Properties

Key Question No. 1 asked, essentially, "Do the reviewed instrument(s) have demonstrated reliability, validity, and normative data?" To address this issue, we had first to establish criteria for evaluating those properties of the instruments. We then could assess instruments against a priori, scientific benchmarks in which SSA (and the Agency for Healthcare Research and Quality [AHRQ]) could have confidence, should questions arise as to how we arrived at our ultimate conclusions.

Evaluation Criteria and Summary Results

Psychometric Evaluation Criteria

As described in Chapter 2, our criterion for reliability was "strictly" met if the following conditions held: (1) Internal consistency reliability, measured using either Cronbach's coefficient alpha or Kuder-Richardson statistics (K-R 20), is greater than or equal to 0.90; and (2) test-retest/intra-rater reliability is greater than or equal to 0.90 if measured using a correlation coefficient, or greater than or equal to 0.80 if measured using Cohen's Kappa; and (3) inter-rater reliability is greater than or equal to 0.90 if measured using a correlation coefficient, or greater than 0.80 if measured using Cohen's Kappa. Some might reasonably argue that the criterion for internal consistency reliability is set too high given the complexity of speech and language functioning and disorders. Additionally, the resultant variability in daily performance suggests that our criterion for test-retest reliability or intra-rater reliability also may be too high. Thus, we defined a "relaxed" criterion, which differs from the strict criterion in that internal consistency reliability may be as low as 0.80 and/or test-retest/intra-rater reliability may be as low as 0.80 (correlations) or 0.70 (Cohen's Kappa). The relaxed criterion is at a level suitable for having confidence in group, rather than individual, comparisons.

We emphasize here that the criterion of reliability (whether strict or relaxed) can be said to have greater weight than validity criteria, in that no instrument can be said to be valid if it is not demonstrably reliable. Said another way, users of an instrument cannot know for sure that the instrument is measuring what it purports to measure (i.e., is valid) if, upon multiple administrations, it produces responses or data that are unreliable according to accepted measurement standards.

We set the criterion for validity primarily on the basis of construct validity; however, we do report evidence of concurrent validity. In doing so, we required that interrelationships among subtests, composites, and total scores, measured with correlation coefficients, be statistically significant (i.e., p < 0.05), with a magnitude of at least 0.30. Information derived from principal components or factor analysis supporting the construction of composite or total scores can augment this information.

The criterion concerning availability of normative data required that (1) data be available for the population targeted by the instrument, (2) the sample size be adequate (i.e., at least 100 subjects per group), and (3) evidence be provided on how well the sample represents the population(s) of interest. We describe applicability of the evidence to the SSA's targeted populations, but we did not formally incorporate the issue of applicability into our criteria for judging normative data. We note that, for SSA's purposes, issues of normative data probably come into play only for instruments exhibiting some evidence of reliability and validity.

Evidence Grading Criteria

As described in Chapter 2, we separated instrument manuals from peer-reviewed literature before assigning a grade to the strength of available evidence supporting each key question. We did so because traditional evidence-based practice would dictate that we downgrade the manuals for not having been published in the peer-reviewed literature. However, because a number of the instrument manuals employed rigorous psychometric methods in the instrument development and validation process, we judged it important to reflect those efforts in our grading scheme. Reflecting the quality of the individual instruments was also important, and doing so would not be possible if we had graded using a system that assigned an "unacceptable" grade to all the manuals for reason of their not having been peer-reviewed.

For both instrument manuals and peer-reviewed literature, a body of evidence in support of the key question would be considered acceptable if the manual or the majority of the available articles reported on well-conducted analyses, had reasonably sized, representative samples; and met our psychometric evaluation criteria. Unacceptable evidence would be that for which the studies had been poorly conducted, used small or nonrepresentative samples, or had results that did not meet or only partially met the psychometric criteria discussed earlier.

Tabular Summaries

Table 24 summarizes the available reliability and validity data and normative information for the 18 instruments that we reviewed and indicates whether the psychometric evaluation criteria are met. We document whether each instrument (listed in alphabetical order within disorder) met either the strict or the relaxed criterion for reliability; the next two sets of columns indicate with an X whether the instrument met construct and concurrent validity criteria and whether normative data are available and demonstrably representative of the US population. Finally, the right-most columns show (with an X) whether the instrument can comfortably be said to be applicable to several special populations. (In all cases, a blank means the instrument does not meet the relevant evaluation variable.)

Table 25. Strength of Available Evidence for Instruments, by Key Questiona
InstrumentKey Question No. 1Key Question No. 2
Adult Language
BDAE-2Unacceptable 
PICAUnacceptableUnacceptable
WABUnacceptable (original only) 
 
Child Language
CELF-3UnacceptableAcceptable (earlier version)
CELF-PUnacceptable 
CELF-3SpAcceptable 
PLS-3Acceptable (except for children 0-8 months old) 
PLS-3SpUnacceptable 
TOLD-P:3AcceptableAcceptable (earlier version)
TOLD-I:3Acceptable 
TOPLUnacceptable 
 
Adult Speech
AIDSUnacceptable 
DEBUnacceptable 
SSI-3Unacceptable 
 
Child Speech
GFTA-2UnacceptableAcceptable (earlier version)
SSI-3Unacceptable 
 
Voice
GRBASUnacceptable 
MDVPAcceptable 
VHIAcceptable 
a

Blank cells indicate that we found no data on predictive validity and thus could not evaluate Key Question No. 2.

Table 25 summarizes for each instrument the strength of available evidence for the two key questions. We document whether the available evidence is acceptable or unacceptable. In all cases, a blank means that no evidence exists to address the key question. The following paragraphs discuss how the instruments measure up against our psychometric evaluation and evidence strength criteria.

Adult Language Disorder Instruments
Reliability

Of the three adult language disorder instruments we evaluated (see the top three rows of Table 24), the Porch Index of Communicative Ability (PICA) met the relaxed reliability criterion. The original version of the Western Aphasia Battery (WAB) met strict criteria, but no information was provided for the 2nd edition. The Boston Diagnostic Aphasia Examination, 2nd Edition (BDAE-2), met neither criterion because internal consistency data did not meet either criterion and test-retest/intra-rater and inter-rater reliability were not measured.

Validity

All three adult language disorder instruments met the construct validity criterion. However, Crary et al. reported a small study that suggested that neither the WAB nor the BDAE-2 consistently classified individuals with aphasia.44 Only for the WAB was evidence of concurrent validity available.

Availability of Normative Data

Normative data of various types are available for only the BDAE-2 and the WAB. However, the instrument developers derived these data only from individuals with aphasia at single institutions and provided no information as to whether these individuals are representative of the population of typical individuals with aphasia.

Summary

Only the PICA and the original WAB met our a priori standards of evidence for reliability and validity; neither one met our standards for the availability of representative normative data. Overall, for strength of evidence, we assigned a grade of "unacceptable" for all three adult language disorder instruments (see Table 25).

Child Language Disorder Instruments
Reliability

Of the eight child language instruments we evaluated (see Table 24), three instruments-the Clinical Evaluation of Language Fundamentals, 3rd Edition, Spanish Edition (CELF-3Sp), Test of Language Development-Primary, 3rd Edition (TOLD-P:3), and the Test of Language Development-Intermediate, 3rd Edition (TOLD-I:3)-met the relaxed reliability criterion for total and composite scores. The Preschool Language Scale, 3rd Edition (PLS-3) Total Score met the relaxed reliability criterion for all age groups except children between 0 and 8 months of age. The CELF-3 (the basic third edition) met the relaxed criterion for total score but not for composite scores. The PLS-3Sp Total Score met the internal consistency reliability criterion only for children above 18 months of age; measures of inter-rater reliability and test-retest reliability were not reported. The developers of the Test of Pragmatic Language (TOPL) did not measure internal consistency reliability.

Validity

Six instruments wholly met the construct validity criterion: CELF-P, CELF-3Sp, PLS-3, TOLD-P:3, TOLD-I:3, and TOPL. CELF-3 met this criterion for composite scores only. All instruments except the PLS-3Sp had evidence of concurrent validity.

Availability of Normative Data

We found normative data for all instruments except the PLS-3Sp; its developers derived normative data from the English-language version (PLS-3). Of the instruments that provide normative data, all derived the data from nationally representative samples; in the case of the CELF-3Sp, the norms are representative of the US Hispanic population.

Only the developers of the TOLD-P:3 and TOLD-I:3 provided evidence of the reliability and validity for use with children who have learning disabilities, speech-language disorders or delay, mental retardation, and or who are hard of hearing. The developers of the other instruments specifically excluded children with these disabilities from their normative samples.

Summary

Only the CELF-3Sp, TOLD-P:3, and the TOLD-I:3 met the psychometric evaluation standards we established for reliability, validity, and the availability of representative normative data. The PLS-3 met standards for all age groups except children 8 months and younger. Only for these four instruments did we judge the strength of evidence to be "acceptable" for addressing this key question (see Table 25).

Adult Speech Disorder Instruments
Reliability

Of the three adult speech disorder instruments we evaluated (see Table 24), none met either the strict or the relaxed criterion for reliability. The Assessment of Intelligibility in Dysarthric Speech (AIDS) met the individual criteria for test-retest or intra-rater reliability and inter-rater reliability (one of two rating approaches), but its developers evidently did not measure internal consistency reliability. We found data about test-retest/intra-rater and inter-rater reliability for the Stuttering Severity Index for Children and Adults, 3rd Edition (SSI-3); however, the SSI-3 developers applied inappropriate statistics to measure reliability and did not measure internal consistency reliability. We note, however, that for the AIDS and the SSI-3, how one might measure internal consistency reliability is not clear because of the underlying nature of the diagnostic tool. In the case of the Dysarthria Examination Battery (DEB), test-retest/intra-rater and inter-rater reliability did not meet the individual criteria, and internal consistency reliability was not measured.

Validity

Only the SSI-3 met the construct validity criterion; the DEB and AIDS did not. We also uncovered evidence of concurrent validity for the SSI-3 but not for the other two approaches.

Availability of Normative Data

Normative data of various types are available for the AIDS and the SSI-3. However, the instrument developers provided no information that would permit evaluators to know whether the standardization samples are representative of the population of adults with disordered speech. No information is available pertaining to the subpopulations of interest to the SSA.

Summary

Generally, none of the adult speech disorder instruments met the standards of evidence we established for reliability or validity; the exception is the SSI-3 for construct and concurrent validity. No instrument met normative data standards. We assigned a strength of evidence grade of "unacceptable" to these instruments (see Table 25).

Child Speech Disorder Instruments
Reliability

Neither the SSI-3 nor the Goldman-Fristoe Test of Articulation, 2nd Edition (GFTA-2) met reliability thresholds (see Table 24). The GFTA-2 met the strict criterion of internal consistency reliability for all age groups of girls of all ages but not for all groups of boys; all age groups of boys met the relaxed criterion. The SSI-3's developers did not measure internal consistency reliability. Both research teams used inappropriate statistical methods to measure test-retest/intra-rater and inter-rater reliability.

Validity

Although only the SSI-3 met the construct validity criterion, we found some evidence of concurrent validity for the original GFTA (not GFTA-2) and the SSI-3.

Availability of Normative Data

Nationally representative data (by age and sex) are available for the GFTA-2. Although the GFTA-2 investigators included children with various learning disabilities, hearing impairments, and speech and language disorders in their standardization sample, they did not report individual normative data for these groups. Furthermore (and more important for SSA purposes), they indicated that the normative data cannot be used with cognitively impaired or mentally retarded children and provided no guidance on the use of GFTA-2 with non-English-speaking children. Although the SSI-3 has normative data for preschool and school-age children, its developers provided no evidence to assess whether the standardization samples represent the population of stuttering children in this country and no information pertaining to the populations of interest to the SSA.

Summary

Neither instrument met the psychometric evaluation standards we established for reliability, validity, or the availability of normative data. No information emerged about use of these instruments with the targeted subpopulations. The strength of evidence for both instruments was judged "unacceptable" (see Table 25).

Voice Disorder Instruments

The three instruments reviewed for voice disorders -- GRBAS (Grade, Rough, Breathy, Asthenic, Strain) Scale, Multi-dimensional Voice Program, Model 5105 (MDVP), and Voice Handicap Instrument (VHI) -- were extremely diverse (Table 24). For at least one of these instruments, the MDVP, traditional approaches to measuring reliability and validity likely are not appropriate. Thus, for the MDVP, we indicate which criteria may not be appropriate and have judged the instrument by only the applicable or appropriate criteria.

Reliability

Only the VHI met the strict criterion for reliability. The MDVP met the criterion for test-retest/intra-rater and inter-rater reliability but not for internal consistency reliability criteria. However, assessing internal consistency reliability may not be appropriate for the MDVP because it measures and reports 33 acoustical parameters that would not be compared to each other because different voice disorders would affect the score of each independently.

Validity

Only the VHI met the construct validity criterion; this criterion was not appropriate for the MDVP given the nature of the instrument described earlier for reliability. The VHI's developers reported detailed evidence of concurrent validity comparing results from the instrument to those from the Short-Form (SF)-36, which is considered a standard instrument for measuring health-related quality of life. We also identified problematic evidence of concurrent validity for the MDVP, although the developers compared the MDVP against the GRBAS Scale, an instrument for which we found little to no evidence of either reliability or validity.

Availability of Normative Data

Only the developers of the MDVP reported normative data, although they indicated that the data are representative only of individuals like those in their standardization sample. They further suggested that MDVP users should develop their own normative data based on their specific patient populations. The VHI developers provided a form of normative data by comparing SF-36 scores of voice-disordered individuals with those of normal individuals and individuals with various forms of chronic diseases.

Summary

Both the MDVP and the VHI met our psychometric evaluation standards for reliability, validity, and normative data. For neither instrument, however, is any information available concerning SSA's targeted subpopulations. For these instruments we judged the strength of evidence to be "acceptable" for addressing this key question (see Table 25).

Key Question No. 2: Predictive Validity

For SSA's purposes, information on the characteristics of diagnostic tools (as presented above for the 18 instruments we reviewed) is a necessary but not sufficient body of information. Part of the requirements concerning determinations for disability eligibility involve making some assessment of future performance or lasting and significant impairment. To assist in that type of forecasting, evaluation or diagnostic instruments should also be able to provide some predictive information, and that is the realm of predictive validity.

Of the 18 instruments we reviewed, information on predictive validity was available, but of variable quality, for only four -- one for adult language disorder, two for child language disorders (but not for versions directly reviewed in this report), and one for child speech disorders. These are discussed in more detail below. No instrument we reviewed for either adult speech disorders or voice disorders had evidence of predictive validity.

Adult Language Disorder Instruments

The two PICA studies provide limited but contradictory evidence of its ability to predict future impairment at 6 months. According to Lendrem and Lincoln,48 the PICA can do so, but according to a later study by Lincoln and McGuirk,49 it cannot. In both studies, the samples were small and derived from investigations conducted for another purpose; the investigators presented no information on whether the results could be generalized to typical adult aphasics. Further, the studies employed different methods (which were generally poorly documented). Given this picture, we cannot conclude that evidence supports predictive validity for this category of instruments.

Child Language Disorder Instruments

We identified limited predictive validity evidence related to the revised version of the CELF (CELF-R) and the second edition of the TOLD-P (TOLD-P:2); none was found for the versions of these instruments selected for this evidence review. Kotsopoulos et al. reported data on the ability of the CELF-R instrument to predict reading, spelling, and math performance in children with severe behavioral and psychiatric disorders who were attending a day treatment program.59 Although the initial CELF-R performance significantly predicted the children's gains in grade-level scores for reading and math (but not spelling) across the academic year, we caution, for several reasons, that users should not conclude that CELF-R scores predict future impairment. In particular, the relatively small sample comprised children with psychiatric and behavioral disorders, so one cannot evaluate the generalizability of the results to populations beyond those having characteristics similar to the sample.

In 2000, Lewis et al. evaluated the predictive validity of the TOLD-P:2 for school-age language, reading, and spelling skills among 87 children who as preschoolers had been identified with moderate to severe speech sound disorders.72 School-age (elementary school) follow-up data were available for 52 of the original sample. Preschool TOLD-P:2 scores were a significant predictor of school-age language and reading skills but not spelling skills. This research group also examined the discriminative power of one subtest and several composites to predict school-age impairments in language, reading, and spelling. For the TOLD-P:2 Semantic composite score, sensitivity was 0.69 and specificity was 0.76 for discriminating between children with and without reading disorders at school age, but this version of the instrument did not discriminate between children with and without language or spelling disorders at school age. The TOLD-P:2 Syntax composite score was able to discriminate between children with and without all three types of disorders at school age. Although Lewis and colleagues conducted their analyses carefully, information remains insufficient to evaluate the generalizability of their results. Thus, we urge caution in concluding that this study supports predictive validity.

Child Speech Disorder Instruments

Although we found no literature describing the predictive validity of the GFTA-2 (or the SSI-3), Lewis et al. also provided some evidence for the original version of the GFTA.72 Preschool scores on the original GFTA significantly predicted school-age reading but not language or spelling skills. Although the sample was relatively small, the research group carefully documented that differential attrition did not occur, but they did not provide sufficient information with which to evaluate the generalizability of their results. Thus, we would urge caution in concluding that this study supports the predictive validity of the GFTA (or GFTA-2).

Summary

Limitations in the studies, including sample size and generalizability, as well as insufficient documentation of methods, weaken our ability to conclude that any of the instruments examined in this evidence report have adequate evidence of predictive validity for either communicative impairment or functioning. Of the four instruments for which we identified evidence (of any sort) of predictive validity, only the TOLD-P:3 had demonstrated "acceptable" evidence of reliability and validity (Key Question No. 1). Unfortunately, evidence to support predictive validity derived from the previous version of the instrument, the TOLD-P:2. Thus, complete and acceptable evidence on both key questions is not available for any instrument that we reviewed. Only for the TOLD-P:3 does the mixed evidence begin to reach this level of acceptability.

Limitations of the Evidence Report

As just discussed, we encountered several challenges in conducting this systematic evidence review. Some are directly related to development of key questions, others to selection of reviewed instruments (given the key questions), and yet others to what literature turned out to be available or accessible. Additionally, most of the instruments themselves have important limitations either generally or for the subpopulations and disorder groups specified. Although some instruments hold promise for the SSA's purposes, we cannot escape the conclusion that this is a comparatively thin evidence base for addressing the important policy and clinical questions of concern to SSA (and AHRQ).

Key Questions and Instruments

Our results and conclusions in this evidence report are, of course, heavily contingent on the structure and focus of the key questions. The original questions (posed as part of the government's request for proposal) were those judged within the SSA to be of high priority to them in assisting their later development of criteria for determining disability in individuals with speech and/or language disorders. We tailored and revised the questions to some extent in early discussions with SSA and AHRQ, to make them compatible with the nature of a systematic evidence review (one without formal recommendations) and manageable within the time and budget constraints of this work. In the end we could devise no completely satisfactory resolution to the SSA's need for a broad canvass of the field; the agency needed a review that was not restricted by patient group or disorder type, and we needed to avoid an unmanageable number of citations that would have been generated by reviewing all relevant assessment instruments.

Given the priority of including both adults and children and three major disorders (speech, language, and voice) while keeping the review within reasonable limits, we engaged in a systematic instrument selection process. Our aim was to optimize scientific and clinical appreciation of issues in diagnosing and caring for patients with these conditions and the likelihood of finding information in an evidence-based approach, while taking account of SSA's time frame and our overall budget. Within these constraints, a national panel of experts selected the 18 instruments through a formal, iterative process at a meeting early in the project. They based their selections on (a) judgments as to the extent to which instruments are already known and used in this field and assess a range of ages and types of speech and language performance and (b) knowledge or beliefs that peer-reviewed information might well exist for the instruments in question.

Thus, these instruments do not cover the universe of instruments in this field or for a population from 0 to 62 years of age; neither do they address all relevant aspects of speech or language disorders in adults and children. Furthermore, they were selected by one group of experts at a particular time. Although we have no special reason to question the selections, we are cognizant of the fact that other expert panels (or this one at some other point in time) might identify a different set of instruments to accord high priority for review.

We emphasize these points because we wish to caution readers not to assume that these particular instruments were necessarily regarded, a priori, as the "best" or "most comprehensive" for evaluating speech language disorders in children or adults. Nonetheless, we are confident that they do represent an appropriate, reasonable, timely, broad-based selection of instruments by which SSA could gain a rigorous view of the state of the science in detecting and predicting the likely outcomes from speech and language disorders.

Generic Deficiencies in this Body of Literature

As is clear from Chapter 3 and the summary above, the peer-reviewed literature rarely yielded data on the reliability and validity for the majority of selected instruments. In comparison with many other clinical fields for which the published evidence base is extensive (and of high quality), even for screening and diagnostic tests, the peer-reviewed knowledge base about ways to identify speech and language disorders and their potential for causing disability is very small. The peer-reviewed literature is the hallmark of systematic evidence reviews, so this lack of information posed a considerable limitation for our work.

Consequently, we elected to expand our efforts to review and abstract data from instrument manuals. Instrument manuals provided reliability and validity data in varying degrees of comprehensiveness. Some very thorough manuals provided reliability and validity information for children by sex, age group, race or ethnicity, presence of speech or language disorder (or both), and similar factors. In some cases, manuals presented data on all aspects of reliability, providing information on internal consistency, test-retest or intra-rater reliability, and inter-rater reliability, and on construct and concurrent or criterion validity. Others provided information on only some aspects of either reliability or validity.

Given this heterogeneity even in non-peer-reviewed publications and to supplement the variable data in instrument manuals, we reviewed, if available, reliability and validity studies in the peer-reviewed literature for the version immediately before the selected version. For example, for child language disorders, we examined peer-reviewed literature for the revised version of the CELF,62 the original version of the GFTA,94 and the Stuttering Severity Instrument for Children and Adults, Revised.89 This tactic has its own limitations. First, in some cases the publications are very old (20 to 30 years in the past). Second, content may well have changed in important ways across the various versions of the instruments. Thus, although such early data may be indicative of reliability and validity of later versions, we cannot at this stage confirm that these data truly and adequately apply to today's versions of the instruments.

Finally, we note study design, conduct, and documentation difficulties in this literature. In a substantial number of reviewed studies, authors provided little information on the sample studied (e.g., demographic characteristics, information on type and severity of speech, language, or voice impairment) or on how they had recruited subjects from the larger population. Disorder-specific samples were often small, and sometimes investigators combined individuals with different types of disorders for analysis purposes without prior analyses to demonstrate that the groups did not differ from each other in any material way. Often, the research teams poorly documented their statistical methods; occasionally, they did not specify statistical approaches at all. For the most part, they did not correct p-values when making multiple statistical comparisons. Statistics not in common use today were sometimes employed, but the publications lacked reference to statistical texts or original articles; hence, we could not determine whether the choice of statistics was appropriate for the past state of knowledge and computing resources.

Limitations Specific to the Reviewed Instruments

Adult and Child Language Disorder Instruments

In general, the literature reviewed for adult language disorders suffered from four problems. First, all the instruments had been standardized using patients from a single institution, thus seriously limiting the generalizability of the results. In no study did the authors provide information with which to judge how the results would compare to those for typical adult patients with language disorders.

Second, all the instruments were tested with individuals with aphasia or language disorders, with the assumption that individuals without disordered language would achieve the maximum score; said in other words, the assumption is that a ceiling effect would be observed. Such an assumption (of a ceiling effect) may not be correct, however, and only one study attempted to evaluate the effect.43 An associated limitation is that, when standardizing instruments, most research teams doing the aphasia studies combined stroke patients and patients with traumatic brain injury. Although this phenomenon is typical of studies evaluating neurological problems during the era in which these instruments were developed, subsequent research suggests that these individuals perform differently on language instruments. To the extent this is true, combined standardization data may not be robust (or useful). Finally, the lack of demonstrated relationship to functional performance is also a significant problem for these instruments.

Several limitations are specific to the child-oriented instruments. Of the eight tests we reviewed, only the TOPL addressed "pragmatics," a critical component in a child's ability to perform well in school. We have little information linking performance on our assessment instruments to school performance or to communication abilities in a nontesting situation. That is, all these instruments evaluated children's communicative abilities in essentially artificial settings and rarely examined their relationship to functional performance outside the testing setting.

Finally, reports on only two instruments (TOLD-P:3 and TOLD-I:3) provide reliability data for children with language disorders or delay, learning and hearing impairments, and mental retardation in a representative way. All other instruments specifically exclude these children. In our view, clinicians would regard including such children as essential, given the importance of early language development to the risk of developing learning disabilities12 and the high rate of comorbid psychiatric and communication disorders.13-15 Thus, an important advance for this field will be to test and systematically validate the other instruments for children with these conditions.

Adult and Child Speech Disorder Instruments

In the case of the three instruments to evaluate adult speech disorders, we found no peer-reviewed articles meeting our a priori inclusion criteria, even when we expanded the search to include literature for a previous version (e.g., Stuttering Severity Instrument for Children and Adults, Revised89). For the DEB, reliability data in the manual had been derived from unpublished studies and conference presentations with very small samples, and so we had excluded these materials during the abstract review phase. As already shown in this review, reliability and validity were studied with poor methods or were poorly documented (or both).

One particular characteristic of the psychometric testing of certain child speech instruments should be highlighted. Namely, for the GFTA-2, apparently no reliability or validity testing has been done on two of its three subtests (Sounds-in-Sentences and Stimulability). However, these two subtests provide important information to clinicians in evaluating articulation disorders, and standardized data would be useful. We are uncertain to what extent this inequality in how subtests are examined extends to other unreviewed instruments intended for child populations.

In short, as was true for language disorders instruments, few data are available to determine whether the reviewed speech disorder instruments can provide information about the subpopulations of patients to whom SSA accords high priority.

Voice Disorder Instruments

The diversity of the voice disorder instruments was problematic in certain respects. First, for all reviewed instruments, data had not been derived from representative samples. Most investigators used congregate patient samples from voice clinics, without using or providing information on selection criteria; in few cases were individuals with "normal" voices evaluated. Many studies provided no demographic information (e.g., age, sex, ethnicity/race). In general, authors did not report whether their study subjects were representative of the population of voice-disordered patients or their nosological subgroups.

Of particular concern (certainly for SSA's purposes) is the lack of data for children or for members of different racial and ethnic groups. Additionally, investigators apparently made little distinction between men's and women's voices. No developer team reported data on age or race subgroups, characteristics for which empirical and clinical studies have shown voice quality to vary.117-119 Thus, the available literature provides little guidance for assessing voice disorders in these populations; as regards subpopulations of particular interest to the SSA, therefore, no data are available.

Additionally, we could not obtain user or instruction manual for two of the voice evaluation instruments. For the GRBAS Scale, we were unable to locate an English-language manual; we found only reports of the scale's use and what most experts consider its introduction to the English language literature.98 Thus, we could not review the original reliability and validity evaluation (assuming one had been conducted). The VHI, a self-administered scale, has no manual at this time; instead, Jacobson et al. describe the instrument, provide administration instructions, and report reliability and validity evaluation.26

Using Our Results to Develop Criteria for Disability

Although our results suggest that few of the reviewed instruments have desirable levels of reliability and validity, we do not mean to suggest that these instruments should not be used as part of a comprehensive evaluation of speech and language disorders in adults and children. Several are robust enough to warrant some confidence in the results they yield with respect to diagnosis for individual patients. Among these, as cases in point, are the CELF-3Sp, the PLS-3, the TOLD-P:3, the MDVP, and the VHI, and perhaps the PICA and CELF-3. Our point is that clinicians and others wishing to administer and apply them must give careful attention to the limitations that we have discussed in this report and factor that information into the ultimate conclusions they reach concerning possible diagnoses of impairment and disability for individual patients.

We emphasize, however, that no viable body of evidence exists to support the use of any of the individual instruments reviewed to predict future performance of the person assessed. For the broader evaluation needs of SSA for determining disability eligibility in terms of "long-run" disability, these instruments would not appear to provide the level of quantitative information presumably desired or required.

Assessing and diagnosing individuals with suspected speech and language disorders is a complex and multifaceted process; it requires a multidisciplinary assessment and a wide variety of tools. Among these are some of the instruments we have evaluated here. Other standardized tests and assessments similar to the ones we examined might also pass muster. Nonetheless, we caution that they can provide only one part of an appropriate disability (or diagnostic) evaluation. Whether they are the most important part remains an empirical question beyond the scope of this systematic review.

Chapter 5. Future Research Directions

Previous chapters have documented the evidence (or lack of it) on reliability, validity, predictive validity, and other characteristics of 18 instruments judged at the outset to be of high priority for examination in this systematic evidence report done on behalf of the Social Security Administration (SSA). Considering the availability and quality of evidence on these instruments, we have identified several important areas for future research to inform the key questions posed by the SSA, to assist clinicians in making appropriate diagnoses and determinations about disability from speech and language disorders in both adults and children, and, more generally, to advance the field of speech and language disorders.

We first present our recommendations about research concerning the psychometric properties of all instruments. Those observations are followed by discussions of a variety of broader research issues related to applications in clinical practice and settings, to subpopulations, to the diagnosis of disorder versus determination of disability, and to the ability of speech and language assessment instruments to predict future performance.

Such empirical, clinical, and methodologic work is appropriate for several federal agencies that support clinical and health services research. That is, such investigations clearly could, and in our estimation should, extend beyond SSA and the Agency for Healthcare Research and Quality (AHRQ) to, for instance, various institutes of the National Institutes of Health, the Maternal and Child Health Bureau in the Health Resources and Services Administration (and its program on children with special health care needs), the National Institute on Disability and Rehabilitation Research (Department of Education), and possibly the Department of Veterans Affairs. In addition, we hope that recommendations for future research reach and prompt activity on the part of the professional and patient advocacy communities appropriate to speech and language disorders.

Psychometric Properties of Instruments

Inherent Quality of Speech and Language Instruments

Basic Measurement Properties

One critical direction for future research relates to the psychometric requirements of reliable, valid assessment instruments. As discussed throughout this report, basic measurement attributes for any diagnostic test or instrument have often not been addressed, let alone documented or met, for these speech and language disorder instruments. We had couched our review in terms of reliability, validity, and predictive validity; in other domains of health care, such attributes might be addressed in sensitivity, specificity, and false-positive or false-negative terms, and these concepts may be more familiar to clinicians. The bottom line for this evidence report, however, is that very few data document these core requirements of instruments or diagnostic tools, however conceptualized. When such data are unavailable for a given instrument, clinicians, policymakers, or patients can have less than full confidence in findings derived from it.

For any assessment instrument developed to assess speech or language disorders, therefore, developers need to provide documentation of all types of instrument reliability (internal consistency, test-retest or intra-rater, and inter-rater reliability) and validity (content, construct, and concurrent validity).28,30 Moreover, they should use currently accepted statistical procedures for psychometric data analyses. In addition, normative samples need to be representative of the population(s) of interest and of sufficient size that instruments can be shown to provide valid, interpretable results.28

In short, for most of the currently available instruments, the minimal need is for further documentation of reliability and validity, followed by development of sound normative data. If they are to continue in general clinical use, many of the instruments we reviewed will require revisions and validation to bring their psychometric properties up to acceptable standards.

Logistically, greater collaboration between psychometricians and clinical experts in speech and language disorders during all phases of the instrument development process would yield more instruments that not only incorporate critical content for assessment but also meet the psychometric requirements for good assessment instruments. However, improving the psychometric soundness of instruments available for the evaluation of speech and language disorders requires more than the attention of instrument developers to these concerns. It also requires considerable financial resources; clearly, many of the currently available speech and language assessment instruments have been developed to meet an identified clinical need, with minimal financial support. This has led to a proliferation of instruments, but a paucity of ones with demonstrated reliability, validity, and generalizability. Whether future instruments for assessing speech and language functioning are developed under the auspices of public funding or private sponsorship, the need remains to commit sufficient resources to allow for psychometrically sound products. Models for developing such instruments through an evidence-based development process can be found through reference to the psychological literature pertaining, for example, to the development of the Wechsler scales measuring general intelligence in children and adults.

Ceiling Effects

An additional issue related primarily to adult speech and language assessment instruments is the assumption (for most of these instruments) that adults who are functioning normally would perform at a ceiling level if tested on the instruments. Evidence related to this assumption is available only for the Boston Diagnostic Aphasia Examination, 2nd Edition.43 Research verifying the correctness of assumptions regarding normal speech and language performance of adults would contribute to confidence that instruments standardized on individuals with known disorders are not overidentifying speech and language impairments (i.e., leading to false-positive diagnoses and inappropriate labeling of patients).

Publication and Documentation

In the spirit of evidence-based practice, we note that publishing the information called for above in peer-reviewed journals is also critical. As noted in Chapters 2 and 4 of this report, we relaxed what might be regarded as the commonly followed standards of systematic reviews by including substantial amounts of gray literature in the form of users' manuals and instruction guides. That is, we learned in doing this evidence report that data on the reliability and validity of speech and language assessment instruments are rarely found in the peer-reviewed literature; such information is largely confined to the instrument manuals, which do not undergo a peer-review process and cannot really be considered acceptable venues for documenting these properties of the instruments. Thus, we suggest that journal editors in fields concerned with speech and language disorders encourage the submission of reports on instrument reliability and validity, identify peer reviewers who are qualified to evaluate the rigor of these types of reports, and then publish such data in their journals.

Normative Data and Samples

Instruments developed to measure speech and language disorders in adults have used samples of individuals with disorders as the normative sample. Unquestionably, finding the numbers of individuals with disorders needed for the development of psychometrically sound assessment instruments is challenging indeed. Nonetheless, developers report almost no evidence regarding the representativeness of their samples used in standardizing these instruments. This issue needs attention in future research.

A related issue is the importance of developing normative data for subgroups and avoiding inappropriate aggregation of subgroups defined by different disease processes, comorbidities, or prognoses. For example, in the standardization of adult language instruments, individuals with traumatic brain injury and those with cerebrovascular accidents have been grouped together. Patterns of cognitive-linguistic functioning following these two types of neurological insults are different, however, and clinicians' ability to interpret assessment results is impeded when normative databases combine data on these two subgroups.

An issue with several measures of child speech and language development reviewed for this report pertains to the systematic exclusion of children with disabilities from the normative sample, resulting in a lack of representativeness of the samples and possible inflation of the numbers of children who would be identified with speech or language disorders using these measures. In short, the representativeness of samples used in instrument development and standardization is an area for attention in future research across the age span and various disorders comprising the focus of this report.

Issues Relating to Clinical Applications

Costs, Benefits, and Harms

Assessing the feasibility of use of these instruments in clinical settings was beyond the purview of this evidence report. Consequently, we did not formally evaluate the cost or the burden on patients, clinicians, or administrators of actually applying these instruments. (Chapter 3 and its instrument-specific tables describes the basic steps for using the instruments and comments on usability.) Information on costs and burden to patients and to those in health care delivery settings might well be valuable, however, in helping SSA or clinicians in selecting among otherwise seemingly similar instruments.

In addition, in extreme cases of impairment, using the types of instruments reviewed in this report may be an unnecessary cost, with disability status being readily determined on the basis of gross clinical criteria that do not require fine-grained assessment or quantitative comparison to normative samples. Thus, areas for future research are (a) to compare the relative sensitivity and specificity of different approaches to disability determination for different types and degrees of speech and language impairment and (b) to determine when the relative costs and benefits justify the addition of standardized instruments to the assessment process rather than relying solely on clinical judgments.

Also not dealt with in this report or literature on this topic is the matter of the adverse effects or harms of diagnostic testing or disability evaluation. Although clinicians, SSA experts, and others may not judge harms stemming from the use of these instruments to be large, nonetheless the possibility exists that patients may be misdiagnosed or categorized wrongly. Persons with true disease may be missed and, thus, wrongly kept off or struck from SSA disability eligibility rolls; persons identified as disabled who truly are not may suffer emotionally or financially from such mislabeling. In both cases, SSA faces the twin problems of inappropriate persons covered, or not covered, within the disability program. We urge that researchers take a broader perspective to the investigation of speech and language instruments, so as to shed some light on the likelihoods that adults or children may be mislabeled (in both positive and negative ways) and on the consequences of such labeling.

Normative Data and Subpopulations

With the increasing cultural, linguistic, and racial diversity of the US population, the applicability of assessment instruments to individuals who are members of different subpopulations is of crucial importance to clinical diagnosis and the process of disability determination. In addition to demographic subpopulations, the applicability of speech and language assessment instruments for reliable and valid assessment of individuals with different disorders is important, because speech and language impairments may contribute to the disability of people identified with other disorders, such as severe physical impairment, mental retardation, learning disorders, and hearing impairment. Inclusion of representative numbers of subpopulation members in normative samples during instrument standardization is important, but it is not sufficient to ensure the applicability of instruments to various subgroups.

Improving the evidence base requires analyses examining reliability and validity of instruments for subpopulations, not just for the total normative sample. The only instruments we reviewed that have provided good quality evidence in this respect were the Test of Language Development (TOLD)-Primary, 3rd Edition and the Test of Language Development-Intermediate, 3rd Edition. Subgroups for which TOLD information can be found include children who are speech-language disordered, learning disabled, mentally retarded, and hard of hearing. All three adult language instruments were standardized on individuals who have language and cognitive impairments associated with strokes and traumatic brain injury, but as noted earlier, the data from these two groups were combined and analyses for subgroups were not presented.

Despite this start on instrumentation for various subgroups, clinicians and policymakers need to recognize that dialect, language, or cultural differences, or functional differences due to certain types of impairment, may well preclude reliable and valid assessment with existing instruments. Despite the existence of a large number of speech and language assessment instruments, we still lack appropriate instruments for reliably and validly assessing speech and language in many of these diverse subgroups. Thus, future research funding and priorities should be directed at addressing these serious deficiencies. Funding sources should encourage research teams that represent collaborations among professionals with expertise in speech and language disorders, cultural experts for the demographic subpopulations of interest, professionals with expertise in disorders that often co-occur with speech and language impairment, and psychometric experts.

Evaluating Existing and Emerging Therapies and Treatments

Some work in the evidence-based practice field (for instance, that done on behalf of the current US Preventive Services Task Force120) tends to examine clinical questions about diagnostic (or screening tests) within a context of effective therapies. We did not take that approach in this evidence report because of the already expansive scope of the assignment; hence, questions concerning the efficacy or effectiveness of various treatment of speech and language disorders have here gone unexamined.

We note, however, that a rich agenda of research remains to be pursued concerning appropriate ways to manage speech, language, or voice disorders in both adults and children. A necessary part of such investigations involves tracking patients' progress over time, and obviously the types of instruments reviewed here could play a part in such outcomes assessments.

We caution, however, that the deficiencies in many of these popular and well-known instruments must be addressed before they can be used with confidence in treatment trials or studies. The basic measurement issues were discussed above, but in addition, methodologic work needs to be done on the responsiveness of these instruments (that is, on their sensitivity to change and on the calculation of appropriate effect sizes that reflect change over time for individuals and groups). One strategy for those engaging in or supporting research on the management of patients with speech and language disorders is to build solid methdologic research directly into treatment and rehabilitation studies, thereby strengthening both the given studies and the measurement field as a whole.

Research on treatment efficacy or effectiveness is also relevant to the concerns of the SSA in disability determination. One factor that must be considered in developing a prognosis for a patient's functioning at the end of a 12-month (or longer) interval is the expected response of the patient to intervention. Thus, building the evidence base on treatment outcomes in speech and language disorders will contribute to policy development and more informed clinical decision-making in this area.

Issues Related to Clinical and Health Policymaking

Impairment and Disability

Most of the instruments we reviewed were designed to provide a measure of the type or degree of impairment (or both) that an individual experiences. The literature has a dearth of information on the relationship between the type or degree of impairment and the functioning of the person in any usual life activities, including those of concern to the SSA in its disability determinations. This "real world" functioning question suggests a rich research agenda that would not only assist the SSA in decisions about disability but also contribute to the "ecological validity" of all speech and language assessments. We need both more instruments providing direct measurement of activity limitations and participation restrictions and more research demonstrating the relationship between speech and language impairment and activity limitations or participation restrictions.

This is not a completely unexplored area in speech and language instrumentation. For example, the Functional Assessment of Communication Skills for Adults was developed to evaluate the ability to communicate in daily life activities despite existing speech, language, cognitive or hearing impairment.24 Our national panel of experts chose instruments based in part on the likelihood of finding an evidence base in the peer-reviewed literature. Thus, the bias toward instruments measuring impairment in the current report reflects the fact that expert measures of speech and language functioning are very limited. The interests of current policymakers and clinical practitioners may be better served by investing resources into refining such measures and developing an evidence base for them rather than or in addition to focusing on measurement of impairment.

Prediction of Future Functioning

We do call for further research on the second key question considered in this report, that is, the ability of speech and language assessment instruments to predict future functioning or performance. As discussed in Chapters 3 and 4, we found very limited evidence in the literature related to this topic. Predicting future functioning is a key criterion for disability determination, so the SSA will need research results that document the ability of these measures to provide robust, longer-term predictions. Said another way, evidence must be relevant to predicting how an individual will function at least one year from the time of the initial disability determination.

Research on this question needs to be large scale. Moreover, it should not be limited to the predictive value of instruments in assessing specific intervention programs or in predicting future performance of a restricted subgroup. Rather, in terms of broad nation concerns about disability, research should consider prediction of future test performance and future adaptive performance in everyday life among key subpopulations reflecting age, language, co-existing conditions, and other factors.

Costs and Cost-Effectiveness

Important future research in this area includes investigation of the societal costs of speech and language disorders and the societal benefits of treating them. A good deal of work is needed simply on amassing data on costs of illness and costs of treatment. Combined with better information on efficacy and effectiveness of treatment, as called for above, such information then lays the groundwork for researchers, clinicians, and policymakers to understand better the cost-effectiveness of alternative therapeutic modalities. We are not sanguine that the field could move to pure cost-benefit or cost-utility studies any time soon, but such investigations might be placed on a more distant research agenda.

In short, improving the evidence base on disability associated with speech and language disorders could contribute to the development of more meaningful goals and outcome measures in treatment. It could also facilitate better systemic decisionmaking by policymakers and third party payers, among others. Taking the necessary actions to improve the instrumentation by which such disorders are assessed and diagnosed, as identified throughout this evidence report, is a critical first step.

Glossary for Evidence Tables

Glossary of Abbreviations Other than the Names of Instruments

AbbreviationFull Name
% agrpercent agreement
% ilepercentile
AAAfrican American
acct(s)account(s)
ACauditory comprehension (PLS only)
ADHDattention deficit/hyperactivity disorder
ADLActivities of Daily Living
adminadministered
ADSDAbductor Spasmodic Dysphonia
addladditional
agragreement
alphaCronbach's coefficient alpha
AmerAmerican
analyanalysis, analytic
ANOVAanalysis of variance
articarticulation
AsAsian
assessassessment
assocassociation
ASHAAmerican Speech-Language Hearing Association
AQAphasia Quotient (WAB)
b/cbecause
betwbetween
Cclinician's rating for quality score
CIconfidence interval
classclassification
clinclinical
cmcentimeter
cntlcontrol
coeff(s)coefficients
cohcohort
compcomparison
corr(s)correlation(s)
CPSCurrent Population Survey
CQCortical Quotient (WAB)
CVACerebral vascular accident
d(s)day(s)
decrdecreasing/decreases
defndefinition
Deptdepartment
diff(s)difference(s)/different
disdisorder/disease
dxdiagnosis
ECExpressive Communication (PLS only)
educeducation/educational
EnglEnglish
ENTotolaryngologist (ear nose throat specialist)
evalevaluation
expexperienced
expressexpressive
exptexperiment, experimental
Ffemale
fafactor analysis
finfinal
frag(s)fragment(s)
fxnfunction
grp(s)group(s)
h(s)hour(s)
hxhistory
IADLInstrumental activities of daily living
ICCIntraclass correlation
ICRInternal consistency reliability
identidentification
impairimpairment
improvimprovement
inclincluding
incrincreases/increasing
inexpinexperienced
infoinformation
initinitial
injinjury
instr(s)instrument(s)
insuffinsufficient
int'linternational
InterRinter-rater reliability
IntraRintra-rater reliability
kappaCohen's kappa statistic
K-R statKuder-Richardson statistic
langlanguage
LDlearning disabled
learn disabilitylearning disability
LHleft hemisphere
LQLanguage Quotient (WAB)
M (quality score)methodologist's ranking for Quality Score
Mmale
m(s)month(s)
maxmaximum
mdlmiddle
mdnmedian
ment retardmentally retarded/mental retardation
minminimum
Mnmean
msecmilliseconds
NAnot applicable
NENew England
negnegative
neuroneurological/neurologically
non-AAnon-African American
non-Asnon-Asian
norm(s)normative(s)
NPVnegative predictive value
NRnot reported
NSnonsignificant
origoriginal
p-ticpediatric
PCprincipal components
PCAprincipal components analysis
pgmprogram
phonophonological
physphysical
popnpopulation
pospositive
PPVpositive predictive value
predpredictor
preschpreschool
probprobability
propnproportion
prospprospective
ptspatients
qcationsqualifications
RCTrandomized controlled trial
receptreceptive
refreference
rehabrehabilitation
relreliability
RHright hemisphere
SDstandard deviation
secseconds
SEMstandard error of the mean
senssensitivity
sigsignificant
SLspeech language
SLDspeech language disorder
SLPspeech language pathologist
spspeech
SpanSpanish
specspecificity
stat(s)statistically/statistics
stdstandardization
stdizedstandardized
subj(s)subject(s)
suffsufficient
surgsurgery
sympsymptoms
T-RRtest-retest reliability
txtreatment
undstdunderstand/understanding
unkunknown
untxuntreated
VAvoice analog
VAverbal ability (PLS only)
VASVisual Analog Scale
w/with
w/nwithin
w/owithout
w(s)week(s)
wrtwith respect to
yyear
y(s)oyear(s) old

Instrument Abbreviations and Full Names

AbbreviationFull Name
AAPSArizona Articulation Proficiency Scale
AIDSAssessment of Intelligibility in Dysarthric Speech
BDAEBoston Diagnostic Aphasia Examination
BLT-2Bankson Language Test, Second Edition
CELFClinical Evaluation of Language Fundamentals
CCSAComprehensive Scales of Student Abilities
DASDifferential Abilities Scales
DEBDysarthria Examination Battery
GFTAGoldman-Fristoe Test of Articulation
GRBASGrade/Rough/Breath/Aesthenic/Strain
KTEAKaufman Test of Educational Achievement.
MDVPMulti-Dimensional Voice Program
NCCEANeurosensory Center Comprehensive Examination of Aphasia
PICAPorch Index of Communicative Ability
PLSPreschool Language Scale
PPVT-RPeabody Picture Vocabulary Test-Revised Edition.
RCPMRaven's Coloured Progressive Matrices
SF-36Medical Outcomes Study, Short Form-36
SSIStuttering Severity Index
TELDTest of Early Language Development
TOAL-3Test of Adolescent and Adult Language, Third Edition
TOLDTest of Language Development
TONITest of Nonverbal Intelligence
TOPLTest of Pragmatic Language
TOWL-2Test of Written Language, Second Edition
TWS-3Test of Written Spelling, Third Edition
VHIVoice Handicap Index
WABWestern Aphasia Battery
WIATWechsler Individual Achievement Test
WISC-IIIWechsler Intelligence Scale for Children, Third Edition.
WPPIS-RWechsler Preschool and Primary Intelligence Scale, Revised
WRMT-RWoodcock Reading Mastery Tests, Revised

Evidence Tables

Appendix A: Acknowledgments

This study was supported by Contract No. 290-97-0011 Task No. 8 from the Agency for Healthcare Research and Quality (AHRQ). We acknowledge the assistance of Jacqueline Besteman, J.D., M.A., the AHRQ Task Order Officer for the Evidence-Based Practice Center Program, and Ernestine (Tina) Murray, R.N., M.A., the AHRQ Task Order Officer for this task. We also appreciate the cooperation and support of Sandra Z. Salan, M.D., the Social Security Administration liaison with AHRQ.

The Technical Expert Advisory Group (TEAG listed in Appendix B) played an integral and active role in shaping and producing the report. The following individuals from the SSA provided input to the investigators and the TEAG into the development of key clinical questions and the selection of the instruments reviewed: Janet Bendann; E. Lucinda Cassett-James, Ph.D; Paul Burgan, M.D.; Marquita Rand, Ph.D.; Roberta A. Schulman, Ph.D.; and Frank Schuster, M.D. We also appreciate the additional guidance provided by the 10 peer reviewers listed in Appendix C.

The investigators appreciate the time and efforts of our data abstractors. The methodologic abstractors were Michael Edwards, B.A., and Kelly Kandra, B.A., Anna Norwood, M.S., and Misty Raasch, M.S., evaluated the clinical usability of the instruments. Lynn Whitener, Dr.P.H., M.S.L.S., provided expert electronic search expertise to the data collection phase, and Joy R. Harris, B.A., and Christian J. Setzer, B.A., provided fundamental assistance during the data collection and abstraction phases of the project. Christopher J. Wiesen, Ph.D., and Carla M. Bann, Ph.D., provided invaluable assistance in the development of data abstraction forms and the psychometrically-based scheme for grading the evidence in this report. We are greatly indebted to Anne Jackman, M.S.W., for her capable project management and for the support of Timothy S. Carey, M.D., M.P.H., Co-Director of the RTI-UNC Evidence-based Practice Center.

Finally, we are grateful for the guidance and assistance we received from Jessica Nelson, B.A., for her project management support; Philib Salib, B.A. and Linda Lux, M.P.A., in preparing this report; Loraine Monroe for her outstanding word processing support; and Richard Strowd, J.D., for exceptional contracting support.

Appendix B: Technical Expert Advisory Group (TEAG)

Our multidisciplinary technical expert advisory group (TEAG) was composed of 10 individuals (Table B.1) who provided expertise in: (1) clinical areas including speech, language, and voice disorders, neurology and neuropsychology (in both children and adults), otolaryngology, and developmental pediatrics; (2) professional societies and organizations; (3) development, validation, and use of psychometric tests and other instruments; and (4) likely users of this evidence report such as the Social Security Administration (SSA) and other entities interested in the appropriate evaluation of speech and learning disorders. Nominations for potential TEAG members were made by colleagues in the Division of Speech and Hearing Sciences at the University of North Carolina (UNC) at Chapel Hill, the SSA liaison to the project, and the Agency for Healthcare Research and Quality (AHRQ); other names emerged from a basic search of the literature.

Staff of the RTI-UNC Evidence-based Practice Center (EPC) contacted nominees by telephone and electronic mail between June and August 2000, described the project and the types of activities and responsibilities that a TEAG member would undertake during the project period, gave an estimate of the time required to complete activities, and inquired as to the person's interest and willingness to serve on the TEAG. Six nominees declined to participate, citing lack of time to complete the required tasks or inability to attend the one-day meeting; all expressed interest and support for the project. Names of other experts were solicited from individuals who declined to participate; ultimately, three TEAG members were recruited through this process.

TEAG responsibilites included: (1) attendance at a one-day meeting to assist in the development or modification of the key clinical questions and causal pathway and the selection and prioritization of instruments for inclusion in the evidence review; (2) attendance at conference calls to discuss project progress and to address difficulties encountered by the RTI-UNC EPC team; and (3) general availability by telephone and electronic mail to address questions as they arose. TEAG members each received a $1000 honorarium for their participation. Additionally, all TEAG members were invited to serve as peer reviewers of the draft evidence report; TEAG members did not receive compensation for their peer review.

Table B.1. Members of Technical Expert Advisory Group (TEAG)Appendix B: Technical Expert Advisory Group (TEAG)
NameTraining/ExpertiseOrganizationLocation
Pasquale J. Accardo, M.D.Developmental PediatricianWestchester Institute for Human DevelopmentValhalla, NY
Michael S. Benninger, M.D., F.A.C.S.Otolaryngologist-Chair, Dept of Otolaryngology-Head and Neck SurgeryHenry Ford HospitalDetroit, MI
Edward G. Conture, Ph.D.Speech/Language Pathologist with expertise in child speech disorders and fluencyVanderbilt University, Bill Wilkerson CenterNashville, TN
H. Branch Coslett, M.D.Adult/Behavioral NeurologistUniversity of Pennsylvania School of MedicinePhiladelphia, PA
Paul J. Eslinger, Ph.D.Clinical NeuropsychologistPennsylvania State University School of MedicineHershey, PA
Malcolm R. McNeil, Ph.D.Speech/Language Pathologist with expertise in adult language disordersUniversity of Pittsburgh School of MedicinePittsburgh, PA
Rhea Paul, Ph.D.Speech/Language Pathologist with expertise in child language disordersSouthern Connecticut State UniversityNew Haven, CT
Diane Paul-Brown, Ph.D.Consumer issuesAmerican Speech-Language-Hearing AssociationRockville, MD
Robert F. Rust, Jr., M.A., M.D.Pediatric NeurologistUniversity of Virginia School of MedicineCharlottesville, VA
Kathryn M. Yorkston, Ph.D.Speech/Language Pathologist with expertise in adult speech disordersUniversity of WashingtonSeattle, WA

Appendix C: Peer Review

Broad peer review of draft evidence reports is critical to the validity and ultimate acceptability of systematic reviews from the evidence-based practice centers (EPC) of the Agency for Healthcare Research and Quality (AHRQ) and thus to the credibility of the AHRQ Evidence-based Practice Program more generally. It was equally important to satisfy the longer-term goals of the Social Security Administration (SSA) in this area of research and specifically for the evidence report on Criteria for Determining Disability in Speech and Language Disorders.

Thus, we requested review by 20 experts representing: (1) clinical areas including speech, language, and voice disorders, neurology and neuropsychology (both in children and adults), otolaryngology, and developmental pediatrics; (2) professional societies and organizations; (3) development, validation, and use of psychometric tests and other instruments; and (4) likely users of this EPC report, such as the SSA and other entities interested in the appropriate evaluation of speech and learning disorders (Table C1). We invited all 10 TEAG members (Appendix B) to provide peer review of the evidence report, a responsibility that exceeded their initial mandate.

On March 21, 2001, we sent letters to the 20 nominated experts and organizations explaining the peer review process and asking if they would be available to conduct the review in July and to return the comments by early August. Two experts declined our invitation to review the report because of time constraints and professional responsibilities. We could not reach the director of one organization nominated for review despite repeated attempts to contact her through September 2001. Unanticipated project delays led to a six-week delay in our delivery of the report to peer reviewers, and we thus sent the report to them on August 23, 2001, with a request for comments to be returned to us by September 21, 2001. Our review period coincided with the tragic national events of September 11, and we consequently extended our deadline for reviews until October 10, 2001. Of the 17 reviewers who agreed to perform the review, seven (four nominees from organizations and three individual experts) were unable to return reviews because of time constraints. Ultimately, 10 reviewers returned reviews. In addition, staff from both AHRQ and SSA reviewed the document and returned comments.

Peer reviewers provided comments on the content, structure, and format of the evidence report, paying particular attention the inclusion/exclusion of literature for the selected tests, to the analysis and interpretation of study results and evidence, and to the discussion of gaps and areas that should be targeted for future research. EPC staff took all comments into account in revisions to this evidence report; these are documented in a separate document recording the disposition of all peer review suggestions that was submitted to AHRQ late in 2001.

Table C.1. Peer ReviewersAppendix C: Peer Review
NameDegreesAffiliation
Pasquale AccardoM.D.TEAG member
Susan G. AllenM.E.D., M.Ed., C.C.C.-S.L.P., C.E.D.Alexander Graham Bell Association for the Deaf
Michael BenningerM.D., F.A.C.S.TEAG member
Jane BlalockM.D.Learning Disabilities Association of America
Carl A. CoelhoPh.D.National Aphasia Association
Paul J. EslingerPh.D.TEAG member
Malcolm McNeilPh.D.TEAG member
Rhea PaulPh.D.TEAG member
Diane Paul-BrownPh.D.TEAG member
Kathryn M. YorkstonPh.D.TEAG member

Appendix D: Methodology

This appendix provides additional detail on selected aspects of the methodological approach adopted by the RTI-University of North Carolina at Chapel Hill (UNC) Evidence-based Practice Center (RTI-UNC EPC). We first discuss the process we used to modify the key clinical questions and the causal pathway. We then document how we selected instruments and set priorities for literature search and evidence review. We end the appendix with a discussion of the supplemental analysis of the usability of the selected instruments and their manuals. Tables D1 through D4 and Figures D1 through D6 presented here supplement the text in Chapter 2, its Methods Appendix, and this Appendix.

Revision of Key Clinical Questions and Causal Pathway

We developed preliminary key clinical questions and causal pathway in response to the initial request for proposal from the Social Security Administration (SSA) and the Agency for Healthcare Research and Quality (AHRQ). To refine these conceptual issues, we organized a one-day meeting (September 18, 2000, in Rockville, Maryland) to

  • Solicit input from the meeting participants on the utility and appropriateness of the causal pathway and refinement of key clinical questions, and

  • Identify and prioritize evaluation tools to be included in the evidence analysis.

Table D1. Participants in September 18, 2000 Meeting
Research Triangle Institute-University of North Carolina (RTI-UNC) Evidence-based Practice Center
Kathleen Lohr, Ph.D. Co-Director, RTI-UNC EPC Research Triangle Institute RTP, NCAndrea Biddle, Ph.D., M.P.H. Study Director University of North Carolina Chapel Hill, NC
Linda Watson, Ed.D. Scientific Director University of North Carolina Chapel Hill, NCJessica Nelson, B.A. Project Manager Research Triangle Institute RTP, NC
Agency for Healthcare Research and Quality (AHRQ)
Jacqueline Besteman, J.D., M.A. EPC Program Officer Center for Practice & Technology Assessment Agency for Healthcare Research and Quality Rockville, MDErnestine Murray, R.N., M.A.S. Project Officer Center for Practice & Technology Assessment Agency for Healthcare Research and Quality Rockville, MD
Social Security Administration (SSA)
Paul Burgan, M.D. Medical Officer-Pediatrics Office of Disability Division of Medical and Vocational Policy Baltimore, MDSandra Z. Salan, M.D. Medical Officer-Neurology Office of Disability Division of Medical and Vocational Policy Baltimore, MD
Frank Schuster, M.D. Medical Officer-Pediatric Neurology Office of Disability Division of Medical and Vocational Policy Baltimore, MDMarquita Rand, Ph.D. Speech and Language Pathologist Program Analyst Office of Disability Division of Medical and Vocational Policy Baltimore, MD
Janet Bendann Program Analyst Childhood Disability Branch Division of Medical and Vocational Policy Baltimore, MDRoberta A. Schulman, Ph.D. Speech/Language Pathologist Federal Disability Determination Service Baltimore, MD
E. Lucinda Cassett-James, Ph.D. Speech/Language Pathologist Federal Disability Determination Service Baltimore, MD
Technical Expert Advisory Group (TEAG) Members
Robert F. Rust, Jr., M.D. Pediatric Neurologist University of Virginia School of Medicine Charlottesville, VAMichael Benninger, M.D., F.A.C.S. Otolaryngologist Henry Ford Hospital Detroit, MI
H. Branch Coslett, M.D. Adult/Behavioral Neurologist Temple University School of Medicine Philadelphia, PAPaul Eslinger, Ph.D. Clinical Neuropsychologist Pennsylvania State University School of Medicine Hershey, PA
Diane Paul Brown, Ph.D. Representative, Consumer Issues ASHA Rockville, MDRhea Paul, Ph.D. Speech/Language Pathologist Southern Connecticut State University New Haven, CT
Edward Conture, Ph.D. Speech/Language Pathologist Vanderbilt Bill Wilkerson Center Nashville, TNMalcolm McNeil, Ph.D. Speech/Language Pathologist University of Pittsburgh Pittsburgh, PA
Kathryn Yorkston, Ph.D. Speech Language Pathologist University of Washington Seattle, WA
Meeting participants included the 10 members of our formal Technical Expert Advisory Group (TEAG) (see Appendix B), AHRQ staff, and appropriate SSA representatives. The meeting participants (Table D1) included individuals with clinical expertise in speech, language, and voice disorders in adults and children, neurology (adult and pediatric), otolaryngology, developmental pediatrics, and both educational and vocational aspects of speech and language disorders; representatives of professional societies (e.g., American Psychological Association, American Academy of Otolaryngologists-Head and Neck Surgeons, American Speech-Language-Hearing Association) and health care systems also participated.

Key Clinical Questions

During the meeting the discussion of the key clinical questions concerned (1) whether language impairment included oral, aural, and written impairments; (2) which populations to include or exclude; and (3) differences between impairment and function and whether it is appropriate to use tools for impairment to evaluate future functioning or performance. The first issue was addressed by leaving the terminology in the key clinical questions broad (i.e., speech and language disorders) rather than by limiting to spoken language (current SSA criteria examine spoken or verbal rather than written language). The second issue was not resolved; rather, the RTI-UNC EPC evaluated tools looking for evidence for all of the populations listed. The final issue of using measures of impairment to predict future functioning was addressed by rewording the second key clinical question to ensure that measures of impairment are used to predict future impairment and that functioning/performance tools are used to predict future functioning/performance.

Causal Pathway

During the meeting, participants suggested minor revisions to the causal pathway. Specifically, we included voice impairments as a separate category of impairments and added responsiveness (i.e., the ability of the test to detect changes in impairment and to differentiate between levels of severity) and appropriate normative data to the relevant test characteristics of the evaluation tools. We made the former change because voice production is a separate issue with different evaluation tools and complexities. The latter change was made to describe more comprehensively the characteristics of evaluation instruments.

Selection and Prioritization of Instruments for Review

During the September 18, 2000 meeting, we asked participants to select and prioritize instruments for review. The scientific director provided meeting participants with a partial list of speech-language diagnostic tools to use as a reference during this process. The list was not exhaustive but rather was designed to serve as a trigger for suggesting tools. After some discussion of the SSA's inability to accept tests for which normative data are not current, the participants were reminded that the purpose of the task was to select tools for which evidence is available and to allow the SSA to use the resulting evidence report to develop criteria.

The selection process began with the EPC study director and Center co-director encouraging meeting participants to nominate evaluation tools in each of five categories -- adult language, child language, adult speech, child speech, and voice. They then solicited a single tool from each individual, going around the table until no participants made additional suggestions. In general, TEAG members suggested the majority of the evaluation tools. Meeting participants generally suggested only tools in their areas of expertise; physicians were less likely to contribute tools during this process.

In all, 39 separate instruments emerged. We could not have conducted systematic literature reviews and evidence analyses for each of the 39 tools elicited given the project timeline and resources, so the EPC co-director asked meeting participants to set priorities for the tools within the five categories, selecting three tools in each. A formal voting process was not used to elicit the priorities, but the participants substantially agreed about the tools finally selected for review.

During the prioritization process, meeting participants articulated several principles for guiding instrument selection. For language disorders, participants suggested that tools needed to represent receptive language, expressive language, and functional language, with emphasis on tools that test language disorders broadly rather than a particular aspect of language. The panel did not suggest several instruments considered to be standards in the field (e.g., ASHA Functional Assessment of Communication Skills23) because reliability and validity data, although available, evidently had not been published in the peer-reviewed literature. For child language conditions, participants also considered it important to select tools that could be used for different ages groups (i.e., 0 to 3 years of age, 3 to 5 years, and school age). For speech, tools evaluating connected speech were given greater consideration that those evaluating only single word production. Meeting participants also attempted to balance tools for elicited behaviors with those evaluating observed behaviors.

Table D2. Tests Selected by Meeting Participants by Disorder Categorya
Adult Language
Porch Index of Communicative Ability (PICA)
Western Aphasia Battery (WAB)
Boston Diagnostic Aphasia Examination, 2nd Edition (BDAE-2)
Multilingual Aphasia Examination, 3rd Edition/Examen de Afasia Multilingue (Spanish edition)
Communicative Abilities for Daily Living, 2nd Edition
Expressive Vocabulary Test
Revised Token Test
Discourse comprehension lists
Correct Information Units
Information Units
 
Child Language
Clinical Evaluation of Language Fundamentals, 3rd Edition (CELF-3)
CELF-3 Spanish Edition (CELF-3Sp)
CELF-Preschool (CELF-P)
Test of Language Development-Primary, 3rd Edition. (TOLD-P:3)
Test of Language Development-Intermediate, 3rd Edition (TOLD-I:3)
Preschool Language Scale-3 (English) (PLS-3)
Preschool Language Scale-3 (Spanish) (PLS-3Sp)
Test of Pragmatic Language (TOPL)
Index of Productive Syntax
Pragmatic Protocol
Peabody Picture Vocabulary Test III (PPVT)
Test for Auditory Comprehension of Language, 3rd Edition
Communication and Symbolic Behavior Scales (CSBS)
Vineland Adaptive Behavior Scales
Token Test for Children
Developmental Sentence Scoring (DSS)
Expressive Vocabulary Test
Test of Problem Solving
Receptive-Expressive Emergent Language (REEL)-3
Adult Speech
Stuttering Severity Instrument for Children and Adults, 3rd Edition (SSI-3)
Assessment of Intelligibility in Dysarthric Speech (AIDS)
Dysarthria Examination Battery (DEB)
Frenchay Dysarthria Assessment
Intelligibility ratings
 
Child Speech
Stuttering Severity Instrument for Children and Adults, 3rd Edition (SSI-2)
Goldman-Fristoe Test of Articulation, 2nd Edition (GFTA-2)
Phonological process analysis
Photo Articulation Test-3rd Edition
Stuttering Prediction Instrument for Young Children
Khan-Lewis Phonological Analysis
 
Voice
Voice Handicap Index (VHI)
Kay Elemetrics Multi-Dimensional Voice Profile (MDVP)
GRBAS (grade, rough, breathy, asthenic, strain) Scale
a

Bold, italicized text indicates instruments the meeting participants selected for literature review and evidence analysis based on familiarity with tests, known breadth of use of the instrument, and expected level of available evidence.

As noted, meeting participants suggested 39 instruments during this process -- 10 for adult language, 15 for child language, five for adult speech, six for child speech, and three for voice disorders. Table D2 gives the entire list, with those selected for review indicated in italicized, bold text. More than three tools may be indicated for review if the tools apply (in various forms) to both adults and children or appear in both English and Spanish and would likely be captured in a single literature search. As described in Chapter 2, we subsequently excluded phonological process analysis after consultation with TEAG members in December 2000 and with colleagues in the Division of Speech and Hearing Sciences at UNC-Chapel Hill.

Supplemental Analysis -- Usability Analysis

When deciding which instrument to use, a clinician must evaluate whether the manual provides sufficient information on how to administer and score the instrument. As part of our analyses, we evaluated the usability of the instrument manuals. Two second-year speech and language pathology graduate students in the Division of Speech and Hearing Sciences at the University of North Carolina at Chapel Hill independently evaluated each manual for comprehensiveness and ease of use. Each has completed a minimum of 80 hours of supervised training in speech and language disorder assessment divided equally between adults and children. Thus, the evaluations of these raters represent what we might be expect if an experienced speech and language pathologist used an unfamiliar instrument for the first time.

Each reviewer independently completed the Usability Evaluation Form (Figure D6) supervised by EPC clinical experts. After the evaluations were complete, we entered the data into a Microsoft Excel® spreadsheet, coding a "1" if the rater indicated that the instrument met the criterion (i.e., circled "yes" on the form), "0" otherwise. To assure consistency between the graduate student reviewers, we computed Cohen's kappa statistic of inter-rater reliability35 using SAS, version 6.12 (The SAS Institute, Cary, NC) and percentage agreement between the raters (Table D4). For each instrument, we report the number and percentage of criteria met and describe where the two reviewers disagreed in their assessments. We calculated these statistics by adding the number of criteria met, scoring 1 point for criteria that both reviewers rated as being met and 0.5 points for criteria where only one rated it as being met; and dividing by the total number of criteria (i.e., 8).

Kappa values for the individual criterion ranged from 0.34 to 1.00, suggesting slight moderate to almost perfect agreement.39 Inter-rater agreement ranged from 76.5 percent (13/17) to 100% (17/17). We re-reviewed the criteria for which we observed the lowest level of agreement. In most cases, the disagreement amounted not to whether instrument met the criterion but how much information the manual provided to the reviewer. One of the reviewers seemed to require more detail on use to be comfortable with the instrument and, thus, to rate the criterion as having been met.

Table D1. Participants in September 18, 2000 Meeting

Table D2. Tests Selected by Meeting Participants by Disorder Categorya

Table D3. Search Terms Employed in the Literature Review

Table D4: Percentage Agreement and Inter-rater Reliability (Kappa)Appendix D: Methodology

Figure D1. Evaluation Instruments for Speech and Language Disorders Abstract Review Form

Article Author:____________________________________________________________________
Journal:_________________________________________________________________________
Year of Article:____________________________________________________________________
UID (Unique Identifier):______________________________________________________________
Name of Tool:_____________________________________________________________________
Database: Circle one of the following:
MEDLINE  CINAHL  PSYCHLIT  ERIC  HAPI  UNKNOWN
Abstractor Initials:______________________________
If the abstract is not available, stop here and return this form to Anne Jackman.
1.Includes information on: reliability (e.g., internal consistency, test-retest, intra- or inter-rater) AND/OR validity (e.g., construct, concurrent, or predictive validity for future communicative functioning) of evaluation tool(s)
YesNoCannot Determine
2.After completion of study each analysis group is greater than or equal to 20 subjects
YesNoCannot Determine
3.After completion of study each analysis group is greater than or equal to 10 subjects
YesNoCannot Determine
4.Study design is one of the following:
  1. RCT (double, single-blinded or cross-over)

  2. Nonequivalent control group design

  3. Prospective or retrospective cohort

  4. Cohort study not otherwise specified

  5. Case-control study

  6. Psychometric testing of all types

  7. Meta-analysis, meta-regression, or cross-design synthesis

YesNoCannot Determine
5.Includes children (birth-21) and/or adults*
YesNoCannot Determine

* May include older individuals but majority must fall into this age range.

IF ANY ITEM IN THE GRAY AREA IS CIRCLED, THE ARTICLE IS EXCLUDED.
IF ANY ITEM IN "CANNOT DETERMINE" AREA CIRCLED, PULL ARTICLE FOR FURTHER REVIEW (PFFR).
INCLUDE:__________EXCLUDE:___________PFFR:______________

Figure D2. Data Extraction Form for Peer-Reviewed Articles-Criteria for Determining Disability in Speech Language Disorders (pages 305-312, following)

DATA EXTRACTION FORM FOR PEER-REVIEWED ARTICLES CRITERIA FOR DETERMINING DISABILITY IN SPEECH LANGUAGE DISORDERS
SECTION I: Inclusion/Exclusion Checklist
A. Date of Review (MM/DD/2001): ____ / ____ / 2001
B. Tool Used: (specify instrument(s)/version(s) tested first, criterion/reference test next)
Instrument #1: _____________________________________ Version: ___
Instrument #2: _____________________________________ Version: ___
Instrument #3: _____________________________________ Version: ___
Instrument #4: _____________________________________ Version: ___

* May include older individuals but majority must fall into this age range.

C. ABSTRACTOR: BEFORE BEGINNING TO ABSTRACT THE ARTICLE, ANSWER THE FOLLOWING QUESTIONS.

1. Includes information on reliability (e.g., internal consistency, test-retest, intra- or inter-rater) AND/OR validity (e.g., construct, criteria, or predictive validity for future communicative functioning) of evaluation tool(s)YesNo
2. After completion of study each analysis group is greater than or equal to 20 subjectsYesNo
3. After completion of study each analysis group is greater than or equal to 10 subjectsYesNo
4. Study design is one of the following:
  1. RCT (double, single-blinded or cross-over)

  2. Nonequivalent control group design

  3. Prospective or retrospective cohort

  4. Cohort study not otherwise specified

  5. Case-control study

  6. Psychometric testing of all types

  7. Meta-analysis, meta-regression, or cross-design synthesis

YesNo
5. Includes children (birth-21) and/or adults (18-62)* YesNo

* May include older individuals but majority must fall into this age range.

ABSTRACTOR: IF ANY ITEM THE GRAY AREA IS CIRCLED, THE ARTICLE IS EXCLUDED. RETURN ARTICLE AND FORM TO STUDY DIRECTOR IMMEDIATELY.
SECTION II: Study Background Information
1. Date of Review (MM/DD/2001): __ __ / __ __ / 2001
2. Tool Used: (specify instrument(s)/version(s) tested first, criterion/reference test next)
        Instrument #1: _____________________________________ Version: ___
        Instrument #2: _____________________________________ Version: ___
        Instrument #3: _____________________________________ Version: ___
        Instrument #4: _____________________________________ Version: ___
3. Population (highlight one):1= Children <=21 yrs2=Adults <= 62 yrs
4. Disorder (highlight one):1=Not reported2=Unclear3=Language
 4=Speech5=Voice 
 6=Other (specify)_______________________________
5. Year Published (YYYY):___ ___ ___ ___
6. Country where study conducted: __ __
1=United States2=Canada3=Britain/United Kingdom
4=Australia5=New Zealand6=European Country_______________________
7=South America_____________________8=Asia______________________________
7. Number of Authors: __ __
8. Surname of First Author_______________________________________
9. Background of First Author (highlight all that apply):
1=Not reported8=Pediatrics
2=Unclear9=Neurology
3=Speech Language Pathology10=Otolaryngology
4=Audiology11=Education
5=Psychology12=Test agency/publisher
6=Neuropsychology13=Other (specify)________________________
7=Psychiatry 
10. Funding Source (highlight all that apply):
1=Not reported5=Industry
2=Unclear6=Government Agency
3=Consumer/Patient Organization7=Professional Organization
4=Charity8= Other (specify)_____________________
SECTION III: Study Design and Description
12. Main Objectives (as described by author, give page and paragraph number in article):
Page:_________Paragraph(s):___________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
13. Study Design (highlight one only):
1=Not reported6=Prospective cohort (w/out comparison group)
2=Unclear7=Retrospective cohort (w/ comparison group)
3=Controlled trial, randomized8=Retrospective cohort (w/out comparison group)
4=Controlled trial, non-randomized9= Other (specify)_________________
5=Prospective cohort (w/ comparison group)
14. Length of Study:___________________________________________________________
(If follow-up times vary, include the range, mean, median, and/or mode if they are given.)
15. Study/Evaluation Setting (highlight all that apply):
1=Not reported7=Speech/Language Clinic: Inpatient
2=Unclear8=Speech/Language Clinic: Outpatient
3=Community9=Hospital: Inpatient
4=School: Preschool10=Hospital: Outpatient
5=School: Elementary School11=Other (specify)____________________
6=School: Middle/High School
Outcome(s) Measured
16. Is outcome measured >= 12 months after initial measurement?
1=Yes
2=No→Go to next question 
3=Not reported→Go to next question 
4=Not clear→Go to next question 

17. Is outcome measured >= 6 months after initial measurement?
1=Yes
2=No→See Study Director immediately 
3=Not reported→See Study Director immediately 
4=Not clear→See Study Director immediately 

18. Outcome Measure(s)
(i.e., what types of functioning/performance, instrument used to measure):
(Abstractor: Add or eliminate as needed)
Outcome #1:__________________________
Instrument/Tool #1:_____________________Version: _______________
Outcome #2:__________________________
Tool #2:_____________________________Version: _______________
Outcome #3:__________________________
Tool #3______________________________Version: _______________
Characteristics of Assessors/Evaluators
19. Number of Assessors/Evaluators (if given): __ __
20. Background of Assessors/Evaluators (highlight all that apply):
1=Not reported7=Psychiatrist
2=Unclear8=Pediatrician
3=Speech Language Pathologist9=Neurologist
4=Audiologist10=Otolaryngologist
5=Psychologist11=Graduate Student: (specify discipline)
6=Neuropsychologist →_______________________
Characteristics of Patients/Subjects
Abstractor: Add groups as required.

21. Defining Characteristic(s)
(e.g., children with SL impairment vs. normal, age-matched children; Wernicke's vs. Broca's aphasia, etc.):
    Group 1: _______________________________________________________________
    Group 2: _______________________________________________________________
22. Comorbidities (e.g., learning disorders, mental retardation, hearing impairment, etc.):
    Group 1: ____________________________________________________
    Group 2: ____________________________________________________
23. Total Number of Patients/Subjects:
Group 1:#Initially entered study________Dropouts________Final #_________
Group 2:#Initially entered study________Dropouts________Final #_________
24. Sex (#):
Group 1:Males________Females________
Group 2:Males________Females________
25. Race/Ethnic Group:
26. Age (Y-M): Mean:________________Median:___________________SD:___________________Range:___________________
27. Other Demographic Characteristics (Abstractor fill in as needed/appropriate)

28. Inclusion Criteria Reported (Highlight one):
1=Yes2=No3=Unclear
Page:_______________Paragraph(s):_______________
(Abstractors: Note differences between groups 1 and 2, if any.)
1.______________________________________________________________________________________
2.______________________________________________________________________________________
3.______________________________________________________________________________________
4.______________________________________________________________________________________
29. Exclusion Criteria Reported (Highlight one):
1=Yes2=No3=Unclear
Page:_______________Paragraph(s):_________
(Abstractors: Note differences between groups 1 and 2, if any.)
1.______________________________________________________________________________________
2.______________________________________________________________________________________
3.______________________________________________________________________________________
4.______________________________________________________________________________________
30. Intervention(s) Studied:
(Abstractor describe the therapy or intervention provided to study participants. If none given, write "none" on first line. Add or eliminate lines as needed to describe.)
Page:__________ Paragraph(s):_________
1.______________________________________________________________________________________
2.______________________________________________________________________________________
3.______________________________________________________________________________________
4.______________________________________________________________________________________
SECTION IV: Statistical Analysis
Statistical Methods Employed:
31. What statistics are employed? (Highlight all that apply)
1=Not reported5=Regression modeling-linear
2=Unclear6=Logistic regression modeling
3=t-tests, Z scores, ANOVA6=Factor analysis
4=Correlations7=Other (specify)________________________________
32. If different statistical methods are used to address the outcomes, describe which methods are used for each outcome:
Outcome #1: ________________________________
Statistical Method #1: _____________________________________________________________________
______________________________________________________________________________________
______________________________________________________________________________________
Outcome #2: ________________________________
Statistical Method #2: _____________________________________________________________________
______________________________________________________________________________________
______________________________________________________________________________________
Outcome #3: ________________________________
Statistical Method #3: _____________________________________________________________________
______________________________________________________________________________________
______________________________________________________________________________________
33. What factors were controlled for in the analyses? (e.g., age, gender, race, comorbidities, initial speech/language impairment, etc.)
1.______________________________________________________________________________________
2.______________________________________________________________________________________
3.______________________________________________________________________________________
4.______________________________________________________________________________________
SECTION V: Results
34. What were the outcome(s) observed for each group?
[Abstractor: Summarize results for each outcome, including statistic(s) used, and indicate source of data (page and paragraph number) for each group. Measures of central tendency or dispersion (i.e., standard deviation) and significance levels should be included, as appropriate.]
Outcome #1:________________________________
Page:_____________Paragraph/Table:___________
______________________________________________________________________________________
______________________________________________________________________________________
______________________________________________________________________________________
______________________________________________________________________________________
Outcome #2:________________________________
Page:_______________Paragraph/Table:____________
______________________________________________________________________________________
______________________________________________________________________________________
______________________________________________________________________________________
______________________________________________________________________________________
Outcome #3:__________________________
Page:_____________Paragraph/Table:_____________
______________________________________________________________________________________
______________________________________________________________________________________
______________________________________________________________________________________
______________________________________________________________________________________
SECTION VI: Abstractor Notes and Comments
35. Limitations Noted in Article (provide page and paragraph numbers):
Page:___________Paragraph(s):_________
______________________________________________________________________________________
______________________________________________________________________________________
______________________________________________________________________________________
______________________________________________________________________________________
______________________________________________________________________________________
______________________________________________________________________________________
______________________________________________________________________________________
______________________________________________________________________________________
36. Limitations Noted by Reviewer (provide page and paragraph numbers):
Page:_____________Paragraph(s):__________
______________________________________________________________________________________
______________________________________________________________________________________
______________________________________________________________________________________
______________________________________________________________________________________
______________________________________________________________________________________
______________________________________________________________________________________
______________________________________________________________________________________
37. Abstractor Notes to Second Reviewer:
(Please highlight any areas or data in the article, listing page and paragraph(s), which do not fit into the abstraction form but you think are important for the second reviewer to read and evaluation.
Page:____________Paragraph(s):________
Reason:________________________________________________________________________________
______________________________________________________________________________________
Page:___________Paragraph(s):_______
Reason:________________________________________________________________________________
______________________________________________________________________________________
Page:_________Paragraph(s):_____________
Reason:________________________________
______________________________________________________________________________________

Figure D3. Data Extraction Form for Instrument Manuals Criteria for Determining Disability in Speech Language Disorders (pages 315-334)

Appendix D: Methodology
DATA EXTRACTION FORM CRITERIA FOR DETERMINING DISABILITY IN SPEECH LANGUAGE DISORDERS
SECTION 1: Study Background Information
Date of Review (MM/DD/2001): ____ / ____ / 2001
Tool Used: (specify instrument(s)/version(s) tested first, criterion/reference test next)
        Instrument #1:____________________________ Version: ___
        Instrument #2:____________________________ Version: ___
        Instrument #3: ____________________________ Version: ___
        Instrument #4: ____________________________ Version: ___
Population (highlight one):1= Children <=21 yrs2=Adults <= 62 yrs
Disorder (highlight one):1=Not reporte2=Unclear  3=Language
4=Speech5=Voice 
6=Other (specify)_______________________________________________
Year Published (YYYY): __ __ __ __
Country where study conducted: __ __
1=United States2=Canada3=Britain/United Kingdom
4=Australia5=New Zealand6=European Country_______________
7=South America__________________8=Asia__________________________
Number of Authors: __ __ (Put 1 if only testing service given)
Surname of First Author____________________________________
Background of First Author (highlight all that apply):
1=Not reported8=Pediatrics
2=Unclear9=Neurology
3=Speech Language Pathology10=Otolaryngology
4=Audiology11=Education
5=Psychology12=Test agency/publisher
6=Neuropsychology13=Other (specify)_______________
7=Psychiatry
Funding Source (highlight all that apply):
1=Not reported5=Industry
2=Unclear6=Government Agency
3=Consumer/Patient Organization7=Professional Organization
4=Charity8= Other (specify)__________________
SECTION II: Overall Study Description
Main Objectives (as described by author, give page and paragraph number in article):
Page:________Paragraph(s):_________
________________________________________________________________________________________
__________________________________________________________________________________________
__________________________________________________________________________________________
__________________________________________________________________________________________
Study/Evaluation Setting (highlight all that apply):
1=Not reported  7=Speech/Language Clinic: Inpatient
2=Unclear  8=Speech/Language Clinic: Outpatient
3=Community  9=Hospital: Inpatient
4=School: Preschool  10=Hospital: Outpatient
5=School: Elementary School  11=Other (specify)__________________________
6=School: Middle/High School
Characteristics of Assessors/Evaluators
Number of Assessors/Evaluators (if given): __ __
Background of Assessors/Evaluators (highlight all that apply):
1=Not reported  7=Psychiatrist
2=Unclear  8=Pediatrician
3=Speech Language Pathologist  9=Neurologist
4=Audiologist  10=Otolaryngologist
5=Psychologist  11=Graduate Student: (specify discipline)
6=Neuropsychologist  →________________________
Characteristics of Patients/Subjects
Abstractor: Add or eliminate groups as required.
Group 1
Defining Characteristic(s)___________________________________________
Co-morbidities:____________________________________________________
Total Number of Patients/Subjects:
Group 1:  #Initially entered study_________Dropouts_______#_______________
Group 2:  #Initially entered study_________Dropouts_______#_______________
Sex (#):
Group 1:  Males_______Females_______
Group 2:  Males_______Females_______
Race/Ethnic Group:
Race/EthnicityGroup 1Group 2
#%#%
Race/Ethnic Group:
Race/EthnicityGroup 1Group 2
#%#%
Not Reported    
Unclear    
White    
Black orAfrican American    
Hispanic or Latino    
Asian    
American Indian or Alaskan Native    
Native Hawaiian or Pacific Islander    
Other (specify)    
  Age (Y-M):  Mean________  Median_______  SD___________  Range_____________
Age Group for Children: (Abstractor fill in as needed/appropriate)
Age Group (Y-M to Y-M)Group 1Group 2
#%#%
Age Group for Children: (Abstractor fill in as needed/appropriate)
Age Group (Y-M to Y-M)Group 1Group 2
#%#%
     
     
     
     
     
     
     
     
     
School Grade for Children: (Abstractor fill in as needed/appropriate)
GradesGroup 1Group 2
#%#%
School Grade for Children: (Abstractor fill in as needed/appropriate)
GradesGroup 1Group 2
#%#%
Pre-School    
     
     
     
     
     
     
     
     
Inclusion Criteria Reported (Highlight one):
1=Yes2=No3=Unclear
Page:______Paragraph(s):________
1.________________________________________________________________________________________
2.________________________________________________________________________________________
3.________________________________________________________________________________________
4.________________________________________________________________________________________
Exclusion Criteria Reported (Highlight one):
1=Yes2=No3=Unclear
Page:________Paragraph(s):___________
1.________________________________________________________________________________________
2.________________________________________________________________________________________
3.________________________________________________________________________________________
4.________________________________________________________________________________________
Study Purpose (highlight all that apply):
Reliability
  1. Internal Consistency

  2. Test-Retest

  3. Inter-rater

  4. Intra-rater

  5. Other

Validity
  1. Content

  2. Construct

  3. Concurrent

  4. Divergent

  5. Predictive

  6. Other

Normative Data
SECTION III: Reliability Evaluations
A. Internal Consistency Reliability Evaluation Description
Description of Study Design (summarize and give page and paragraph number in article)
(Skip only if same as overall design description)
Page:_______ Paragraph(s):__________
__________________________________________________________________________________________
__________________________________________________________________________________________
__________________________________________________________________________________________
__________________________________________________________________________________________
Dates of Study______________________________________________________________________________ (Provide month/year of starting and ending points, if given.)
Length of Study Follow-up____________________________________________________________________ (If follow-up times vary, include the range and mean, median, and/or mode if they are given.)
Study/Evaluation Setting (Note only differences from overall study):
__________________________________________________________________________________________
__________________________________________________________________________________________
Characteristics of Assessors/Evaluators (Note only differences from overall study):
__________________________________________________________________________________________
__________________________________________________________________________________________
Characteristics of Patients/Subjects (Note only differences from overall study):
__________________________________________________________________________________________
__________________________________________________________________________________________
Statistical Methods Employed--Internal Consistency
What statistics are employed?(Highlight all that apply)
Page:______Paragraph(s):_______
1=Not reported  4=Correlation Coefficient
2=Unclear  5=Item-total Correlation
3=Cronbach's Coefficient α  6=Other (specify)________________
Comments about how statistical methods are applied
__________________________________________________________________________________________
__________________________________________________________________________________________
__________________________________________________________________________________________
Results -- Internal Consistency Reliability
Abstractor: Summarize results, including statistic used, and indicate source of data (page and table number). Repeat section as necessary for subtests and overall score.
e.g., (Page 5, Table 2). Cronbach's alpha for sounds-in-words tests reported for each age group. Alphas range from 0.XX to 0.XX with a mean of 0.XX. Alphas reported separately for males (0.XX) and females (0.XX).....
Page:_________Table(s):_____________
Range:________Mean:__________Median:________
__________________________________________________________________________________________
__________________________________________________________________________________________
__________________________________________________________________________________________
__________________________________________________________________________________________
B. Test-Retest Reliability Evaluation;Description of Study Design (summarize and give page and paragraph number in article)
(Skip only if same as overall design description)
Page:_________ Paragraph(s):________
__________________________________________________________________________________________
__________________________________________________________________________________________
__________________________________________________________________________________________
__________________________________________________________________________________________
Dates of Study______________________________________________________________________________ (Provide month/year of starting and ending points, if given.) (Skip only if same as overall design description)
Study/Evaluation Setting (Note only differences from overall study):
__________________________________________________________________________________________
__________________________________________________________________________________________

Characteristics of Assessors/Evaluators (Note only differences from overall study):
__________________________________________________________________________________________
__________________________________________________________________________________________
Characteristics of Patients/Subjects (Note only differences from overall study):
__________________________________________________________________________________________
__________________________________________________________________________________________
Statistical Methods Employed--Test-Retest Reliability
Is test-retest reliability reported for overall score?
1=Not reported   2=Unclear   3=No   4=Yes
What statistics are employed? (Highlight all that apply)
1=Not reported4=Percent of agreement
2=Unclear5=Correlation between scores
3=Cohen's Kappa6=Other (specify)_________________
Is test-retest reliability reported on a per-item basis, if appropriate?
1=Not reported   2=Unclear   3=No   4=Yes
What statistics are employed? (Highlight all that apply)
1=Not reported4=Percent of agreement
2=Unclear5=Correlation between scores
3=Cohen's Kappa6=Other (specify)_________________
Comments about how statistical methods are applied
__________________________________________________________________________________________
__________________________________________________________________________________________
__________________________________________________________________________________________
__________________________________________________________________________________________
Results -- Test-Retest ReliabilityAbstractor: Summarize results, including statistic used, and indicate source of data (page and table number). Repeat section as necessary for subtests and overall score. e.g., (Page 5, Table 2). Cohen's Kappa for sounds-in-words tests reported for each age group for each of the three consonant positions (initial/medial/final). Kappas for initial sounds range from 0.XX to 0.XX with a mean of 0.XX. Kappas for medial sounds range from 0.XX to 0.XX with a mean of 0.XX. Kappas for final sounds range from range from 0.XX to 0.XX with a mean of 0.XX...__________________________________________________________________________________

Page:
____________

Table(s):
__________
Range:___________Mean:_______Median:____________
__________________________________________________________________________________________
__________________________________________________________________________________________
__________________________________________________________________________________________
__________________________________________________________________________________________
C. Inter-Rater Reliability Evaluation:
(Skip to next section only if same as overall design description)

Description of Study Design
(summarize and give page and paragraph number in article)
Page:________Paragraph(s):__________
__________________________________________________________________________________________
__________________________________________________________________________________________
__________________________________________________________________________________________
__________________________________________________________________________________________
Dates of Study______________________________________________________________________________ (Provide month/year of starting and ending points, if given.)
Length of Study Follow-up____________________________________________________________________ (If follow-up times vary, include the range and mean, median, and/or mode if they are given.)
Study/Evaluation Setting (Note only differences from overall study):
__________________________________________________________________________________________
__________________________________________________________________________________________
Characteristics of Assessors/Evaluators (Note only differences from overall study):
__________________________________________________________________________________________
__________________________________________________________________________________________
Characteristics of Patients/Subjects (Note only differences from overall study):
__________________________________________________________________________________________
__________________________________________________________________________________________
Statistical Methods Employed--Inter-rater Reliability
Is inter-rater reliability reported for overall score?
1=Not reported   2=Unclear   3=No   4=Yes
What statistics are employed? (Highlight all that apply)
1=Not reported4=Percent of agreement
2=Unclear5=Correlation between scores
3=Cohen's Kappa6=Other (specify)________________

Is inter-rater reliability reported on a per-item basis, if appropriate?
1=Not reported   2=Unclear   3=No   4=Yes
What statistics are employed? (Highlight all that apply)
1=Not reported4=Percent of agreement
2=Unclear5=Correlation between scores
3=Cohen's Kappa6=Other (specify)________________
Comments about how statistical methods are applied
__________________________________________________________________________________________
__________________________________________________________________________________________
__________________________________________________________________________________________
__________________________________________________________________________________________
Results -- Inter-rater ReliabilityAbstractor: Summarize results, including statistic used, and indicate source of data (page and table number). Repeat section as necessary for subtests and overall score.
Page:___________Table(s):___________
Range:__________Mean:_____________Median:_________
__________________________________________________________________________________________
__________________________________________________________________________________________
__________________________________________________________________________________________
__________________________________________________________________________________________
D. Intra-Rater Reliability Evaluation:
(Skip to next section only if same as overall design description)

Description of Study Design
(summarize and give page and paragraph number in article)
Page:_________ Paragraph(s):_______
__________________________________________________________________________________________
__________________________________________________________________________________________
__________________________________________________________________________________________
__________________________________________________________________________________________
Dates of Study______________________________________________________________________________ (Provide month/year of starting and ending points, if given.)
Length of Study Follow-up____________________________________________________________________ (If follow-up times vary, include the range and mean, median, and/or mode if they are given.)
Study/Evaluation Setting (Note only differences from overall study):
__________________________________________________________________________________________
__________________________________________________________________________________________
Characteristics of Assessors/Evaluators (Note only differences from overall study):
__________________________________________________________________________________________
__________________________________________________________________________________________
Characteristics of Patients/Subjects (Note only differences from overall study):
__________________________________________________________________________________________
__________________________________________________________________________________________
Statistical Methods Employed -- Intra-rater Reliability
Is intra-rater reliability reported for overall score?
1=Not reported   2=Unclear   3=No   4=Yes
What statistics are employed? (Highlight all that apply)
1=Not reported4=Percent of agreement
2=Unclear5=Other (specify)________________
3=Cohen's Kappa
Is intra-rater reliability reported on a per-item basis, if appropriate?
1=Not reported   2=Unclear   3=No   4=Yes
What statistics are employed? (Highlight all that apply)
1=Not reported4=Percent of agreement
2=Unclear5=Other (specify)________________
3=Cohen's Kappa
Comments about how statistical methods are applied
__________________________________________________________________________________________
__________________________________________________________________________________________
__________________________________________________________________________________________
__________________________________________________________________________________________
Results -- Intra-rater ReliabilityAbstractor: Summarize results, including statistic used, and indicate source of data (page and table number). Repeat section as necessary for subtests and overall score.
Page:____________Table(s):____________
Range:___________Mean:_________Median:___________
__________________________________________________________________________________________
__________________________________________________________________________________________
__________________________________________________________________________________________
__________________________________________________________________________________________
SECTION IV: Validity Evaluations

A. Content Validity Evaluation Description
Describe the construct of interest, including evidence that construct is well-defined.
Page_________    Paragraph____________
__________________________________________________________________________________________
__________________________________________________________________________________________
__________________________________________________________________________________________
Is the construct of interest well-defined (Highlight one)
1=Yes   2=No   3=Unclear
Do they provide evidence that they have covered the construct (Highlight one)
1=Yes   2=No   3=Unclear
Describe any anomalies observed:
Page:______    Paragraph(s):_________
__________________________________________________________________________________________
__________________________________________________________________________________________
__________________________________________________________________________________________
__________________________________________________________________________________________
B. Construct Validity Evaluation DescriptionStudy Design/Dates of Study/Length of Study Follow-up (Note only differences from overall study):
__________________________________________________________________________________________
__________________________________________________________________________________________
Study/Evaluation Setting (Note only differences from overall study):
__________________________________________________________________________________________
__________________________________________________________________________________________
Characteristics of Assessors/Evaluators (Note only differences from overall study):
__________________________________________________________________________________________
__________________________________________________________________________________________
Characteristics of Patients/Subjects (Note only differences from overall study):
__________________________________________________________________________________________
__________________________________________________________________________________________
Statistical Methods Employed -- Construct ValidityWhat statistics are employed? (Highlight all that apply)
1=Not reported4=Correlation Coefficient
2=Unclear5=Discriminant Analysis
3=Cronbach's Coefficient6=Other (specify)______________
Are individual item and subtest/composite score intercorrelations presented?
1=Not reported   2=Unclear   3=No   4=Yes
Description of intercorrelations between individual item and subtest/composite score (please provide page and paragraph number)
Page:________    Paragraph(s):_________
__________________________________________________________________________________________
__________________________________________________________________________________________
__________________________________________________________________________________________
__________________________________________________________________________________________
Are composite score intercorrelations presented?
1=Not reported   2=Unclear   3=No   4=Yes
Description of intercorrelations between composite scores (please provide page and paragraph number)
Page:_______    Paragraph(s):_________
__________________________________________________________________________________________
__________________________________________________________________________________________
__________________________________________________________________________________________
__________________________________________________________________________________________
Are intercorrelations between composite and overall scores presented?
1=Not reported   2=Unclear   3=No   4=Yes
Description of intercorrelations between composite and overall scores (please provide page and paragraph number)
Page:________    Paragraph(s):_________
__________________________________________________________________________________________
__________________________________________________________________________________________
__________________________________________________________________________________________
__________________________________________________________________________________________
Was discriminant analysis performed?
1=Not reported   2=Unclear   3=No   4=Yes
Description of Discriminant Analysis (please provide page and paragraph number)
Page:_______    Paragraph(s):_________
__________________________________________________________________________________________
__________________________________________________________________________________________
__________________________________________________________________________________________
__________________________________________________________________________________________
Comments about how statistical methods are applied
__________________________________________________________________________________________
__________________________________________________________________________________________
__________________________________________________________________________________________
__________________________________________________________________________________________
Results -- Construct ValidityAbstractor: Summarize results, including statistic used, and indicate source of data (page and table number). Repeat section as necessary for subtests and overall score.
Page:_________Table(s):__________
Range:________Mean:________Median:__________
__________________________________________________________________________________________
__________________________________________________________________________________________
__________________________________________________________________________________________
__________________________________________________________________________________________
C. Concurrent Validity Evaluation Description
Study Design/Dates of Study/Length of Study Follow-up (Note only differences from overall study):
__________________________________________________________________________________________
__________________________________________________________________________________________
Study/Evaluation Setting (Note only differences from overall study):
__________________________________________________________________________________________
__________________________________________________________________________________________
Characteristics of Assessors/Evaluators (Note only differences from overall study):
__________________________________________________________________________________________
__________________________________________________________________________________________
Characteristics of Patients/Subjects (Note only differences from overall study):
__________________________________________________________________________________________
__________________________________________________________________________________________
Statistical Methods Employed -- Concurrent Validity
What statistics are employed? (Highlight all that apply)
1=Not reported4=Standard Deviations
2=Unclear5=Correlation Coefficients
3=Means6=Other (specify)___________
Are subtest and composite score means reported for both the test and the criterion tool?
1=Not reported   2=Unclear   3=No   4=Yes
Are subtest and composite score standard deviations reported for the test and the criterion tool?
1=Not reported   2=Unclear   3=No   4=Yes
Are intercorrelations between the test and criterion tool presented?
1=Not reported   2=Unclear   3=No   4=Yes
Comments about how statistical methods are applied
__________________________________________________________________________________________
__________________________________________________________________________________________
__________________________________________________________________________________________
__________________________________________________________________________________________
Results -- Concurrent ValidityAbstractor: Summarize results, including criterion tool and statistic used, and indicate source of data (page and table number). Repeat section as necessary for subtests and overall score.
Page:__________Table(s):_________
Range:_________Mean:____________Median:_______
__________________________________________________________________________________________
__________________________________________________________________________________________
__________________________________________________________________________________________
__________________________________________________________________________________________
D. Divergent Validity Evaluation DescriptionStudy Design/Dates of Study/Length of Study Follow-up (Note only differences from overall study):
__________________________________________________________________________________________
__________________________________________________________________________________________
Study/Evaluation Setting (Note only differences from overall study):
__________________________________________________________________________________________
__________________________________________________________________________________________
Characteristics of Assessors/Evaluators (Note only differences from overall study):
__________________________________________________________________________________________
__________________________________________________________________________________________
Characteristics of Patients/Subjects (Note only differences from overall study):
__________________________________________________________________________________________
__________________________________________________________________________________________
Statistical Methods Employed -- Divergent ValidityWhat statistics are employed? (Highlight all that apply)
1=Not reported4=Standard Deviations
2=Unclear5=Correlation Coefficients
3=Means6=Other (specify)__________
Comments about how statistical methods are applied
__________________________________________________________________________________________
__________________________________________________________________________________________
__________________________________________________________________________________________
__________________________________________________________________________________________
Results -- Divergent ValidityAbstractor: Summarize results, including statistic used, and indicate source of data (page and paragraph number).
Page:__________    Paragraph:__________
__________________________________________________________________________________________
__________________________________________________________________________________________
__________________________________________________________________________________________
__________________________________________________________________________________________
E. Predictive Validity Evaluation DescriptionStudy Design/Dates of Study/Length of Study Follow-up (Note only differences from overall study):
__________________________________________________________________________________________
__________________________________________________________________________________________
Study/Evaluation Setting (Note only differences from overall study):
__________________________________________________________________________________________
__________________________________________________________________________________________
Characteristics of Assessors/Evaluators (Note only differences from overall study):
__________________________________________________________________________________________
__________________________________________________________________________________________
Characteristics of Patients/Subjects (Note only differences from overall study):
__________________________________________________________________________________________
__________________________________________________________________________________________
Is outcome measured > 12 months after initial measurement?
1=Yes 
2=No→See Study Director immediately
3=Not reported→See Study Director immediately
4=Not clear→See Study Director immediately

Tool Used as Outcome Measures (i.e., what types of functioning/performance):
Tool #1:________________________
Tool #2:________________________
Statistical Methods Employed -- Predictive Validity
What statistics are employed? (Highlight all that apply)
1=Not reported5=Regression modeling
2=Unclear6=Factor analysis
3=t-tests, Z scores, ANOVA7=Other (specify)___________
4=Correlations
Results--Predictive ValidityAbstractor: Summarize results, including statistic used, and indicate source of data (page and paragraph number).
Page:_______    Paragraph/Table:____________
__________________________________________________________________________________________
__________________________________________________________________________________________
__________________________________________________________________________________________
__________________________________________________________________________________________
SECTION V: Normative Data Evaluations

Study Design/Dates of Study/Length of Study Follow-up (Note only differences from overall study):
__________________________________________________________________________________________
__________________________________________________________________________________________
Study/Evaluation Setting (Note only differences from overall study):
__________________________________________________________________________________________
__________________________________________________________________________________________
Characteristics of Assessors/Evaluators (Note only differences from overall study):
__________________________________________________________________________________________
__________________________________________________________________________________________
Characteristics of Subjects Sampled (Note only differences from overall study):
__________________________________________________________________________________________
__________________________________________________________________________________________
Description of Study Design (as described by author, give page and paragraph number in article) What population is the sample standardized to and how is this defined: e.g., US population between age 2-0 and 21-0 years as defined by the 1998 U.S. Census
Page:________    Paragraph(s):___________
__________________________________________________________________________________________
__________________________________________________________________________________________
__________________________________________________________________________________________
Is the sample balanced (i.e., is representative or reflects the population)?
1= Not Clear   2=Yes   3=No
List variables used to balance sample:
Page:________    Paragraph(s):__________
__________________________________________________________________________________________
__________________________________________________________________________________________
Statistical Methods Employed -- Normative DataWhat statistics are employed? (highlight all that apply): (This will depend upon the test.)
1=Not reported6=Standard Error of Mean (SEM)
2=Unclear7=Percentile Ranking
3=Mean8=Test-Age Equivalent
4=Median9=Other (specify)__________
5=Standard Deviation
Description of Standard Scales:
(Please describe briefly how the standard scales are derived.) 
Derivation of Percentile Ranking (if applicable):
Page:_________    Paragraph(s):_________
__________________________________________________________________________________________
__________________________________________________________________________________________
__________________________________________________________________________________________
Derivation of Test-Age Equivalent (if applicable):
Page:________    Paragraph(s):_________
__________________________________________________________________________________________
__________________________________________________________________________________________
__________________________________________________________________________________________
Results--Normative DataNormative Data Reported for (highlight all that apply):
1=Not Reported
2=Unclear7=Educational Attainment: Subject
3=Age8=Disorder (specify)___________
4=Gender9=Region
5=Race10=Urban/Rural
6=Educational Attainment: Parental11=Other (specify)__________
SECTION VI: Abstractor Notes and Comments

Limitations Noted in Article
(provide page and paragraph numbers):
Page:_________    Paragraph(s):___________
__________________________________________________________________________________________
__________________________________________________________________________________________
__________________________________________________________________________________________
__________________________________________________________________________________________
__________________________________________________________________________________________
__________________________________________________________________________________________
__________________________________________________________________________________________
_________________________________________________________________
Limitations Noted by Reviewer (provide page and paragraph numbers):
Page:_______    Paragraph(s):____________
__________________________________________________________________________________________
__________________________________________________________________________________________
__________________________________________________________________________________________
__________________________________________________________________________________________
__________________________________________________________________________________________
__________________________________________________________________________________________
__________________________________________________________________________________________
__________________________________________________________________________________________
Abstractor Notes to Second Reviewer: (Please highlight any areas or data in the article, listing page and paragraph(s), which do not fit into the abstraction form but you think are important for the second reviewer to read and evaluation.
Page:______Paragraph(s):________
Reason:___________________________________________________________________________________
_________________________________________________________________________________________
Page:________Paragraph(s):_______
Reason:___________________________________________________________________________________
_________________________________________________________________________________________
Page:________Paragraph(s):_____________
Reason:___________________________________________________________________________________
__________________________________________________________________________________________

Figure D4. Article Quality Rating Forms (pages 337-340)

Quality Grading Scale for Individual Studies

Quality Rating Form Development

The attached table shows a provisional draft of the quality rating scale that we will use to grade individual studies. Although the majority of these studies will provide information on the predictive validity of the instruments selected by the Technical Expert Advisory Group, some will present original data describing the validation of and perhaps normative data for the instrument.

In developing this scale, we were guided by the elements described in the Consolidation of Standard for Reporting Trials (CONSORT; Begg et al., 1996; Meinert, 1998; Moher, 1998) and work of Drs. Kathleen Lohr and Timothy Carey (1999) on the RTI-UNC EPC team. Taking these criteria as a starting point, we made two modifications to assess better the types of diagnostic and evaluation instruments used for speech and language disorders. First, because this report seeks to examine the reliability and validity (including predictive validity) of different instruments across specific target populations, we have expanded the questions regarding protocol description to include more detailed information on the process by which reliability and validity are ascertained. To do this we employed the criteria proposed McCauley and Swisher, (1984) and the American Educational Research Association (2000).

Each item on our grading checklist contributes 1 point to the total quality grade. A total of 13 points can be scored for research design and study conduct, 19 for the measurement of reliability and validity and the development of test norms, and 3 points for the justification of conclusions and external validity considerations.

Use of the Quality Rating Form

The Project Director, Scientific Director, and Dr. Celia Hooper, the RTI-UNC EPC adult speech and language expert, will complete the quality rating forms. The Project Director evaluated all abstracted articles from a methodological perspective. The Scientific Director and Dr. Hooper will evaluate articles in her area of expertise. For areas not within the expertise of either Dr. Hooper or the Scientific Director, expert colleagues from the University of North Carolina at Chapel Hill Division of Speech and Hearing Sciences will conduct the quality rating. To guarantee fidelity in rating, the Scientific Director and Dr. Hooper will re-grade a 10% random sample of articles graded by their Division colleagues.

Two graduate students in the Division of Speech and Hearing Sciences will complete Section V (Usability of the Tool). The Scientific Director and/or Dr. Hooper will oversee their work. To assure consistency between the graduate student reviewers, we will compute the level of inter-rater reliability using Cohen's kappa statistic.

Instructions for Completion and Scoring of Quality Rating Form

The quality reviewer will circle the appropriate number for each item or indicate N/A (not applicable) if the item is not appropriate to the article. The score for the article will be given as a percentage determined by dividing the total number of points circled on the rating form (numerator) by total number of points for the rating for (denominator). In the event that an item on the rating form does not apply, the points for that item will be subtracted from both the numerator and the denominator when calculating the percentage score. The quality rating scores will be reported separately for the clinical and methodological experts rather than averaging the scores across the quality reviewers.

Quality Rating Scale for Individual Studies
IssueNo/not reportedYes 
Category I. Research Design and Study Conduct (13 points)
  1. Is the purpose of the study stated?

  2. Can the research question(s) be addressed with the methods proposed?

0 01 1 
Evaluation Instrument Description and Use:
  • 3

    Is the evaluation instrument(s) (including version) specified?

  • 4

    Is the population specified, including demographic characteristics (e.g., age, gender, race, SES, etc.), presence and/or severity of speech/language impairment, comorbid impairments, if any?

  • 5

    Are the persons administering the instrument representative (i.e. have similar experience, certification) of individuals who will administer the instrument in "everyday" practice?

0 0 01 1 1 
Study Design Considerations:
  • 6

    Is the study design used appropriate for validating the instrument?

  • 7

    Are eligibility criteria or recruitment criteria for the study specified?

  • 8

    Is the sampling strategy (i.e., how subjects were selected from population) specified?

  • 9

    Is loss to follow-up reported?

0 0 0 01 1 1 1 
Internal Validity:
  • 10

    Is a comparison group present?

  • 11

    If a comparison group is present, is attrition differential between the groups?

0 01 1NA NA
Statistical Analysis:
  • 12

    Are multiple comparisons taken into account if multiple univariate tests were performed?

  • 13

    Are statistical tests used appropriate to the data?

0 01 1NA
Category II. Outcomes: Measurement of Reliability/Validity/Normative Data (19 points)
  • 14

    Is construct validity measured?

  • 15

    If construct validity is measured, are/is:

    1. Procedures for selecting experts and eliciting judgements reported

    2. Empirical evidence supporting the relationships between the domains measured by the instrument and cognitive processes, if any, specified

    3. Evidence of and rationale for interpretation of subsets or subscores reported

    4. Evidence of interrelationships, if any, between parts of instrument reported

  • 16

    Is concurrent/criterion validity measured?

  • 17

    If concurrent/criterion validity is measured, is the instrument validated against a "gold standard" or criterion instrument(s)?

  • 18

    Is predictive validity measured?

  • 19

    If predictive validity is measured, are/is:

    1. Hypotheses to be tested reported a priori

    2. Statistical summaries (e.g., means, standard deviations, etc.) describing the association between the instrument score and the outcome measure reported

0 0 0 0 0 0 0 0 0 01 1 1 1 1 1 1 1 1 1NA NA NA NA NA NA NA NA NA NA
Reliability
  • 20

    Is internal consistency reliability measured?

  • 21

    If internal consistency reliability is measured, are appropriate statistics (i.e., Cronbach's coefficient alpha, Kuder-Richardson statistic, Kuder-Richardson-20, KR-20) used?

  • 22

    Is test-retest reliability measured?

  • 23

    If test-retest reliability is measured, are appropriate statistics (i.e., Cohen's kappa for categorical scales, correlations for continuous numeric scored) used?

  • 24

    Is inter-rater reliability measured?

  • 25

    If inter-rater reliability is measured, are appropriate statistics (i.e., Cohen's kappa for categorical scales, correlations for continuous numeric scored) used?

  • 26

    Is intra-rater reliability measured?

  • 27

    If intra-rater reliability is measured, are appropriate statistics (i.e., Cohen's kappa for categorical scales, correlations for continuous numeric scored) used?

  • 28

    If the instrument is used for different populations or age groups and if separate norms are presented, are reliability data provided separately for each group or population?

0 0 0 0 0 0 0 0 01 1 1 1 1 1 1 1 1NA NA NA NA NA NA NA NA NA
Category III. Justification for Conclusions and External Validity (3 points)
  • 29

    Are conclusions warranted from the data?

  • 30

    Do the study conclusions apply to U.S. populations?

  • 31

    Are the limitations of the study reported?

0 0 01 1 1 

References

American Psychological Association, National Council on Measurement in Education. Washington, DC: American Educational Research Association, 1999. Standards for Educational and Psychological Testing.
Begg C, Cho M, Eastwood S, et al. Improving the quality of reporting of randomized controlled trials. The CONSORT statement. JAMA. 1996; 276: 637639. [PubMed]
McCauley RJ, Swisher L. Psychometric review of language and articulation tests for preschool children. J Speech Hearing Disorders. 1984; 49: 3442.
Meinert CL. Beyond CONSORT: Need for improved reporting standards for clinical trials. JAMA. 1998; 279: 14871489. [PubMed]
Moher D. CONSORT: An evolving tool to help improve the quality of reports of randomized controlled trials. JAMA. 1998; 279: 14891491. [PubMed]
Lohr KN, Carey TS. Assessing "best evidence": issues in grading the quality of studies for systematic reviews Jt Comm J Qual Improv 1999 Sep. 25:(9):4709. [PubMed].

Figure D5. Test Manual Quality Rating Forms (pages 343-348)

Provisional Quality Grading Scale for Manuals

Quality Rating Form Development

The attached table presents the draft quality grading scale developed for use with the instrument manuals. The scale previously used by the RTI-UNC Evidence-based Practice Center (EPC) is tailored to randomized and non-randomized clinical trials and is derived from the Consolidation of Standards for Reporting Trials (CONSORT; Begg et al., 1996; Meinert, 1998; Moher, 1998) and the work of Drs. Kathleen Lohr and Timothy Carey (1999) on the RTI-UNC EPC team. We could not evaluate the conduct of studies to measure the reliability and validity of measurement instruments with the existing scale for several reasons.

First, the existing scale is designed for use with randomized and non-randomized clinical trials; psychometric evaluation of an instrument is rarely done with this type of study design and thus many of the important elements of this scale would not be addressed (and appropriately so) by the type of design used in reliability and validity studies. It also would have required a nearly complete revision to include the data elements necessary to appropriately evaluate reliability and validity. Moreover, there exists an established literature describing the criteria and standards for evaluating the development and psychometric testing of educational and psychological instruments. Since 1945, the American Educational Research Association (AERA), American Psychological Association, and the National Council on Measurement in Education have published five documents outlining standards for the development and use of educational and psychological tests. The most recent version, the 1999 Standards for Educational and Psychological Testing, which provides 122 standards for test construction, evaluation, and documentation, forms the basis for our manual quality-rating scheme. These standards specifically describe methodologically and ethically sound approaches for the development and testing of validity; reliability and errors of measurement; test development and revision; the development of scales, norms, and score comparability; test administration, scoring, and reporting; and documentation supporting the instrument.

Our use of the 1999 Standards does not set a precedent in the speech and language pathology literature. In 1984 McCauley and Swisher adapted the 1985 Standards to develop 10 criteria to evaluate the psychometric properties of language and articulation instruments for preschool children. These criteria, while appealing for their simplicity and small number, neither represent the methodological development of the past 15 years nor allow us to evaluate comprehensively the quality of the selected manuals. Consequently, we began with McCauley and Swisher's (1984) criteria and selected 46 additional criteria from the 1999 Standards. Fifty-six questions comprise the manual quality-rating scale. The scale totals 100 points, with 10 associated with the instrument development or revision process, 25 each for the measurement of reliability and validity, 25 for the development of instrument norms or standard scores, 10 for usability of the evaluation instrument, and 5 for justification of conclusions and external validity.

Use of the Quality Rating Form

The Project Director, Scientific Director, and Dr. Celia Hooper, the RTI-UNC EPC adult speech and language expert, will complete a form for each manual. The Scientific Director and Dr. Hooper will evaluate instruments in their area of expertise. The Project Director will evaluate from a methodological standpoint. For areas not within the expertise of either Dr. Hooper or the Scientific Director, expert colleagues from the University of North Carolina at Chapel Hill Division of Speech and Hearing Sciences will conduct the quality rating.

Instructions for Completion and Scoring of Quality Rating Form

The quality reviewer will circle the appropriate number for each item or indicate N/A (not applicable) if the item is not appropriate to the study. The score for the article will be given as a percentage determined by dividing the total number of points circled on the rating form (numerator) by 100 (denominator--the total number of points for the rating form). In the event that an item on the rating form does not apply, the points for that item will be subtracted from both the numerator and the denominator when calculating the percentage score. The quality rating scores will be reported separately for the clinical and methodological experts rather than averaging the scores across the quality reviewers.

References

American Psychological Association, National Council on Measurement in Education. Washington, DC: American Educational Research Association, 1999. Standards for Educational and Psychological Testing.
Begg C, Cho M, Eastwood S, et al. Improving the quality of reporting of randomized controlled trials. The CONSORT statement. JAMA. 1996; 276: 637639. [PubMed]
Lohr KN, Carey TS. Assessing "best evidence": issues in grading the quality of studies for systematic reviews Jt Comm J Qual Improv. 1999 Sep. 25:(9):4709.
McCauley RJ, Swisher L. Psychometric review of language and articulation tests for preschool children. J Speech Hearing Disorders. 1984; 49: 3442.
Meinert CL. Beyond CONSORT: Need for improved reporting standards for clinical trials. JAMA. 1998; 279: 14871489. [PubMed]
Moher D. CONSORT: An evolving tool to help improve the quality of reports of randomized controlled trials. JAMA. 1998; 279: 14891491. [PubMed]

Figure D6. Usability Evaluation Form Criteria for Determining Speech and Language Disorders

Usability Evaluation Form

Date of Review (MM/DD/2001): __ __ / __ __ / 2001
Instrument: _________________________________________________________________________________________________ Version: ___
Instructions: Circle only one item in each row.

CriteriaRating 
Are the procedures for administering the instrument described in sufficient detail to enable users to duplicate administration procedures used during standardization? Are the procedures for administering the instrument described in sufficient detail to enable users to duplicate the scoring procedures used during standardization? If the administrator or scorer of the instrument must have special qualifications, does the manual specify what those qualifications must be? Does the manual document the training required by instrument administrators and/or scorers to use the instrument appropriately? Does the manual supply information about the special environmental or equipment needs required to use and score the instrument? Does the manual clearly explain the meaning and intended interpretations of raw score scales and the limitations of those scores? Does the manual clearly explain the meaning and intended interpretation of derived score scales and the limitations of those scores? When scales are to be used for reporting scores, does the manual clearly describe the construction of the scales?Yes Yes Yes Yes Yes Yes Yes YesNo No No No No No No No

Comments:

References
1.
Disability Evaluation Under Social Security. SSA Publication No. 64-039, Baltimore, Md: Social Security Administration; 1998.
2.
Strategic plan: Plain language version. (no date); accessed August 8, 2001. Web Page. http://www.nidcd.nih.gov/about/director/nsrp.htm.
3.
ASHA Factsheets. accessed June 1, 2000. Web Page. Available at: http://www.asha.org/marketing/bhs_factsheets.htm.
4.
Ruben RJ. Redefining the survivial of the fittest: communication disorders in the 21st century. Laryngoscope. 2000; 110: 241245. [PubMed]
5.
Gierut JA. Treatment efficacy: Functional phonological disorders in children. J Speech Lang Hear Res. 1998; 41: S85S100. [PubMed]
6.
Conture EG. Treatment efficacy: Stuttering. J Speech Hear Res. 1996; 39: S18S26. [PubMed]
7.
Ramig LO, Verdolini K. Treatment efficacy: Voice disorders. J Speech Hear Lang Res. 1998; 41: S101S116.
8.
Tallal P, Ross R, Curtiss S. Unexpected sex-ratios in families of language/learning-impared children. Neuropsychologia. 1989; 27: 987998. [PubMed]
9.
Tomblin JB. Familial concentration of developmental language impairment. J Speech Hear Disord. 1989; 54: 287295. [PubMed]
10.
Felsenfeld S, Plomin R. Epidemiological and offspring analyses of developmental speech disorders using data from the Colorado Adoption Project. J Speech Lang Hear Res. 1997; 40: 778791. [PubMed]
11.
Tomblin JB, Records NL, Buckwalter P, Zhang X, Smith E, O'Brien M. Prevalence of specific language impairment in kindergarten children. J Speech Lang Hear Res. 1997; 40: 12451260. [PubMed]
12.
Gibbs DP, Cooper EB. Prevalence of communication disorders in students with learning disabilities. J Learning Disabilities. 1989; 22: 6063.
13.
Cohen NJ, Vallance DD, Barwick M, et al. The interface between ADHD and language impairment: An examination of language, achievement and cognitive processing. J Child Psychol Psychiatry Allied Disciplines. 2000; 41: 353362.
14.
Giddan JJ, Milling L. Comorbidity of psychiatric and communication disorders in children. Child Adolesc Psychiatr Clin North Am. 1999; 8: 1936.
15.
Tirosh E, Cohen A. Language deficit with attention-deficit disorder: A prevalent comorbidity. J Child Neurol. 1998; 13: 4937. [PubMed]
16.
Cohen NJ, Davine M, Horodezky N, Lipsett L, et al. Unsuspected language impairment in psychiatrically disturbed children: Prevalence and language and behavioral characteristics. J Am Acad Child Adolesc Psychiatr. 1993; 32: 595603.
17.
Skenes LL, McCauley RJ. Psychometric review of nine aphasia tests. J Commun Disord. 1985; 18: 461474. [PubMed]
18.
Kilmon CA, Barber N, Chapman K. Instruments for the screening of speech/language development in children. J Pediatr Health Care. 1991; 5: 6170. [PubMed]
19.
Engen E, Engen T. Rhode Island Test of Language Structure Manual. Baltimore, Md: University Park Press; 1983.
20.
Westby C. Multicultural issues in speech and language assessment. In: Tomblin JD, Morris HL, Spriesterbach DC, eds. 2nd ed. San Diego, Calif: Singular. Diagnosis in Speech-Language Pathology. 2000: 3562.
21.
Johnson W, Darley F, Spriestersbach D. New York, NY: Harper & Row; 1978. Diagnostic Methods in Speech Pathology.
22.
ICID H-2: International Classification of Functioning, Disability and Health. accessed October 1, 1910. Web Page. Available at: http://www.who.int/icidh/. 2001. [Free Full Text in PMC icon.Free Full text in PMC]
23.
Yorkston K, Beukelman D. Austin, Tex: Pro-Ed; 1984. The Assessment of Intelligibility of Dysarthric Speech.
24.
Frattali CM, Thompson CK, Holland AL, Wohl CB, Ferketic MM. The American Speech-Language-Hearing Association Functional Assessment of Communication Skills for Adults (ASHA FACS). Rockville, Md: ASHA; 1995.
25.
Benninger MS, Ahuja AS, Gardner G, Grywalski C. Assessing outcomes for dysphonic patients. J Voice. 1998; 12: 540550. [PubMed]
26.
Jacobson BH, Johnson A, Grywalski C, et al. The Voice Handicap Index (VHI): development and validation. Am J Speech-Lang Pathol. 1997; 6: 6670.
27.
Harris LG, Shelton IS. San Antonio, Tex: Communication Skill Builders; 1998. Desk Reference of Assessment Instruments in Speech and Language (Revised).
28.
Washington, DC: 1999. Standards for Educational and Psychological Testing.
29.
McCauley RJ, Swisher L. Psychometric review of language and articulation tests for preschool children. J Speech Hear Disord. 1984; 49: 3442. [PubMed]
30.
Lohr KN, Aaronson NK, Burnam MA, Patrick DL, Perrin EB, Roberts JS. Evaluating quality-of-life and health status instruments: Development of Scientific Review Criteria. Clin Ther. 1996; 18: 979991. [PubMed]
31.
Anastasi A. 6th ed. New York, NY: MacMillan Publishing Co.; 1988. Psychological Testing.
32.
Nunnally JC, Bernstein IH. 3rd ed. New York, NY: McGraw-Hill, Inc.; 1994. Psychometric Theory.
33.
Cronbach LJ. 5th ed. New York, NY: Harper Collins Publishers, Inc.; 1990. Essentials of Psychological Testing.
34.
Lohr KN, Carey TS. Assessing 'best evidence': issues in grading the quality of studies for systematic reviews. Joint Commission J Qual Improvement. 1999; 25: 470479.
35.
2nd ed. Alexandria, Va: International Medical Publishing; 1996. Guide to Clinical Preventive Services.
36.
Cohen J. A coefficient of agreement for nominal scales. Educ Psychol Meas. 1960; 20: 3746.
37.
Carmines EG, Zeller RA. Beverly Hills, Calif: Sage; 1979. Reliability and Validity Assessment.
38.
Domino G. Upper Saddle River, NJ: Prentice Hall; 2000. Psychological Testing. An Introduction.
39.
Kuder GF, Richardson MW. The theory of estimation of test reliability. Psychometrika. 1937; 2: .
40.
Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics. 1977; 33: 159174. [PubMed]
41.
Halmstadter GG. New York, NY: Appleton-Century-Crofts, Inc.; 1964. Principles of Psychological Measurement.
42.
Cohen J. Weighted kappa: Nominal scale agreement with provision for scaled disagreement or partial credit. Psychol Bull. 1968; 70
43.
Goodglass H, Kaplan E. Media, Pa: Williams and Wilkins; 1983. The Assessment of Aphasia and Related Disorders.
44.
Crary MA, Wertz RT, Deal JL. Classifying aphasias: cluster analysis of Western Aphasia Battery and Boston Diagnostic Aphasia Examination results. Aphasiology. 1992; 6: 2936.
45.
Rosselli M, Ardila A, Florez A, Castro C. Normative data on the Boston Diagnostic Aphasia Examination in a Spanish-speaking population. J Clin Exp Neuropsychol. 1990; 12: 313322. [PubMed]
46.
Porch B. Chicago, Ill: Riverside Publishing Company; 1967. Porch Index of Communicative Ability.
47.
Clark C, Crockett D, Klonoff H. Factor analysis of the Porch Index of Communication Ability. Brain Lang. 1979; 7: 17.
48.
Lendrem W, Lincoln NB. Spontaneous recovery of language in patients with aphasia between 4 and 34 weeks after stroke. J Neurol Neurosurg Psychiatry. 1985; 48: 743748. [PubMed]
49.
Lincoln NB, McGuirk E. Prediction of language recovery in aphasic stroke patients using the Porch Index of Communicative Ability. Br J Disord Commun. 1986; 21: 8388. [PubMed]
50.
Kertesz A. 2nd ed. New York, NY: Grune & Stratton; 1982. Western Aphasia Battery Test Manual.
51.
Shewan CM, Kertesz A. Reliability and validity characteristics of the Western Aphasia Battery (WAB). J Speech Hear Disord. 1980; 45: 308324. [PubMed]
52.
Kertesz A. Orlando, Fla: Grune & Stratton; 1972. The Western Aphasia Battery.
53.
Shewan CM. The language quotient (LQ): A new measure for the Western Aphasia Battery. J Commun Disord. 1986; 19: 427439. [PubMed]
54.
Kertesz A, McCabe P. Recovery patterns and prognosis in aphasia. Brain. 1977; 100 Pt 1: 118. [PubMed]
55.
Crary MA, Gonzalez Rothi LJ. Predicting the Western Aphasia Battery Aphasia Quotient. J Speech Hear Disord. 1989; 54: 163166. [PubMed]
56.
Lincoln NB, Blackburn M, Ellis S, et al. An investigation of factors affecting progress of patients on a stroke unit. J Neurol Neurosurg Psychiatry. 1989; 52: 493496. [PubMed]
57.
Semel E, Wiig EH, Secord WA. 3rd ed. San Antonio, Tex: The Psychological Corporation; 1995. Clinical Evaluation of Language Fundamentals.
58.
Perez E, Slate JR, Neeley R, McDaniel M, et al. Validity of the CELF-R, TONI, and SIT for children referred for auditory processing problems. J Clin Psychol. 1995; 51: 540543. [PubMed]
59.
Kotsopoulos S, Walker S, Beggs K, Jones B, Kotsopoulos A, Patel P. Reading and spelling deficits among children attending a psychiatric day treatment program. Eur Child Adolesc Psychiatry. 1996; 5: 8392. [PubMed]
60.
Lewis BA, O'Donnell B, Freebairn LA, Taylor HG. Spoken language and written expression--interplay of delays. Am J Speech-Lang Pathol. 1998; 7: 7784.
61.
Wiig EH, Secord WA, Semel E. San Antonio, Tex: The Psychological Corporation; 1992. Clinical Evaluation of Language Fundamentals- Preschool.
62.
Semel E, Wiig ES, Secord WA. San Antonio, Tex: The Psychological Corporation; 1987. Clinical Evaluation of Language Fundamentals. Revised.
63.
Wechsler D. San Antonio, Tex: The Psychological Corporation; 1989. Wechsler Preschool and Primary Intelligence Scale. Revised.
64.
Brown CL, Sherbenou RJ, Johnson SK. Austin, Tex: Pro-Ed; 1982. Test of Nonverbal Intelligence (TONI).
65.
Slossen RL. East Aurora, NY: Slossen Education Publications; 1984. Slossen Intelligence Test.
66.
Kaufman AS, Kaufman N. Circle Pines, Minn: American Guidance Service; 1985. Kaufman Test of Educationial Achievement.
67.
Hammill DD, Larsen SC. Austin, Tex: Pro-Ed; 1978. Test of Written Language (TOWL).
68.
Semel E, Wiig EH, Secord WA. Spanish Version. 3rd ed. San Antonio, Tex: The Psychological Corporation; 1997. Clinical Evaluation of Language Fundamentals.
69.
Newcomer PL, Hammill DD. 3rd ed. Austin, Tex: Pro-Ed; 1997. Test of Language Development-Primary.
70.
Fodness RW, McNeilly J, Bradley-Johnson S. Test-retest reliability of the Test of Language Development-2: Primary and Test of Language Development-2: Intermediate. J School Psychol. 1991; 29: 161166.
71.
Hammill D, Newcomer P. 3rd ed. Austin, Tex: Pro-Ed; 1997. Test of Language Development-Intermediate.
72.
Lewis BA, Freebairn LA, Taylor HG. Academic outcomes in children with histories of speech sound disorders. J Commun Disord. 2000; 33: 1130, 8992. [PubMed]
73.
Hammill DD, Brown VL, Larsen SC, Winderholt JL. 3rd ed. Austin, Tex: Pro-Ed; 1994. Test of Adolescent and Adult Language.
74.
Zimmerman I, Steiner V, Pond R. 3rd ed. San Antonio, Tex: The Psychological Corporation; 1992. Preschool Language Scale.
75.
McLoughlin CS, Gullo DF. Comparison of three formal methods of preschool language assessment. Lang Speech Hear Serv Schools. 1984; 15: 146153.
76.
Berryman JD. Use of the revised Preschool Language Scale with older preschool children. Lang Speech Hear Serv Schools. 1983; 14: 7985.
77.
Pecyna Rhyner PM, Bracken BA. Concurrent validity of the Bracken Basic Concept Scale with language and intelligence measures. J Commun Disord. 1988; 21: 479489. [PubMed]
78.
Frankenburg WK, Dodds J, Archer P, et al. Denver II. Denver, Colo: Denver Developmental Materials; 1990.
79.
Zimmerman IL, Steiner VG, Pond RE. Rev ed. San Antonio, Tex: The Psychological Corporation; 1979. Preschool Language Scale.
80.
Hresko WP, Reid DK, Hammill DD. Austin, Tex: Pro-Ed; 1981. The Test of Early Language Development.
81.
Dunn LM, Dunn LM. Rev ed. Minneapolis, Minn: American Guidance Service; 1981. The Peabody Picture Vocabulary Test.
82.
Bracken BA. San Antonio, Tex: The Psychological Corporation; 1984. Bracken Basic Concept Scale: Diagnostic Scale.
83.
Long EE. Native American children's performance on the Preschool Language Scale-3. J Child Commun Dev. 1998; 19: 4347.
84.
Zimmerman IL, Steiner VG, Pond RE. Spanish Version, 3rd ed. San Antonio, Tex: The Psychological Corporation; 1993. Preschool Language Scale.
85.
Phelps-Terasaki D, Phelps-Gunn T. Austin, Tex: Pro-Ed; 1992. Test of Pragmatic Language.
86.
Hresko WP, Reid DK, Hammill DD, Ginsburg HP, Baroody AJ. Austin, Tex: Pro-Ed; 1988. Screening Children for Related Early Educational Needs.
87.
Bryant B, Newcomer PL. Austin, Tex: Pro-Ed; 1991. Scholastic Aptitude Scale.
88.
Drummond S. San Antonio, Tex: Communication Skills Builders; 1993. Dysarthria Examination Battery.
89.
Riley GD. 3rd ed. Austin, Tex: Pro-Ed; 1994. Stuttering Severity Instrument for Children and Adults.
90.
Yarrus JS, Conture EG. Relationship between mother-child speaking rates in adjacent fluent utterances ASHA. 1992. 34:(10):.
91.
Riley G. Austin, Tex: Pro-Ed; 1981. Stuttering Prediction Instrument for Young Children.
92.
Goldman R, Fristoe M. Circle Pines, Minn: American Guidance Service; 2000. Goldman-Fristoe Test of Articulation.
93.
Seymour HN, Seymour CM. Black English and standard American English contrasts in consonantal development of four and five-year old children. J Speech Hear Disord. 1981; 46: 274280. [PubMed]
94.
Goldman R, Fristoe M. Circle Pines, Minn: American Guidance Service, Inc.; 1969. Goldman-Fristoe Test of Articulation.
95.
Botting N, Conti-Ramsden G, Crutchley A. Concordance between teacher/therapist opinion and formal language assessment scores in children with language impairment. Eur J Disord Commun. 1997; 32: 317327. [PubMed]
96.
Mullen PA, Whitehead RL. Stimulus picture identification in articulation testing. J Speech Hear Disord. 1977; 42: 113118. [PubMed]
97.
Fudala JB. Los Angeles, Calif: Western Psychological Services; 1970. Arizona Articulation Proficiency Scale.
98.
Hirano M. New York, NY: Springer-Verlag; 1981:100. Clinical Examination of Voice.
99.
De Bodt MS, Wuyts FL, Van de Heyning PH, Croux C. Test-retest study of the GRBAS scale: influence of experience and professional background on perceptual rating of voice quality. J Voice. 1997; 11: 7480. [PubMed]
100.
Dejonckere PH, Obbens C, de Moor GM, Wieneke GH. Perceptual evaluation of dysphonia: reliability and relevance. Folia Phoniatrica. 1993; 45: 7683. [PubMed]
101.
de Krom G. Consistency and reliability of voice quality ratings for different types of speech fragments. J Speech Hear Res. 1994; 37: 9851000. [PubMed]
102.
Dejonckere PH, Remacle M, Fresnel-Elbaz E, Woisard V, Crevier-Buchman L, Millet B. Differentiated perceptual evaluation of pathological voice quality: reliability and correlations with acoustic measurements. Revue De Laryngologie Otologie Rhinologie. 1996; 117: 219224. [PubMed]
103.
Millet B, Dejonckere PH. What determines the differences in perceptual rating of dysphonia between experienced raters? Folia Phoniatrica Et Logopedica. 1998; 50: 305310.
104.
Wuyts FL, De Bodt MS, Van de Heyning PH. Is the reliability of a visual analog scale higher than an ordinal scale? An experiment with the GRBAS scale for the perceptual evaluation of dysphonia. J Voice. 1999; 13: 508517. [PubMed]
105.
Langeveld TP, Drost HA, Frijns JH, Zwinderman AH, Baatenburg de Jong RJ. Perceptual characteristics of adductor spasmodic dysphonia. Ann Otol Rhinol Laryngol. 2000; 109: 741748. [PubMed]
106.
Kay Elemetrics Corporation. Lincoln, NJ: Kay Elemetrics; 1999. Multi-Dimensional Voice Program (MDVP) Model 5105: Software Instruction Manual.
107.
Deliyski DD, Gress CD. Inter-system reliability of MDVP for Windows 95/98 and DOS. Kay Elemetrics Corporation. Lincoln Park, NJ: Kay Elemetrics Corporation. Multi-Dimensional Voice Program (MDVP) Model 5105: Software Instruction Manual. 1999: 6773.
108.
Kent RD, Vorperian HK, Duffy JR. Reliability of the Multi-Dimensional Voice Program for the analysis of voice samples of subjects with dysarthria. Am J Speech-Lang Pathol. 1999; 8: 129136.
109.
van As CJ, Hilgers FJM, Verdonck-de Leeuw IM, Koopmans-van Beinum FJ. Acoustical analysis and perceptual evaluation of tracheoesophageal prosthetic voice. J Voice. 1998; 12: 239248. [PubMed]
110.
Deliyski DD. Acoustic model and evaluation of pathological voice production. Kay Elemetrics Corporation. Lincoln Park, NJ: Kay Elemetrics Corporation. Multi-Dimensional Voice Program (MDVP) Model 5105: Software Instruction Manual. 1999: 7583.
111.
Goodglass H, Kaplan E. Philadelphia, Pa: Lea & Febiger; 1972. The Assessment of Aphasia and Related Disorders.
112.
Newcomer PL, Hammill DD. Austin, Tex: Pro-Ed; 1977. Test of Language Development.
113.
Newcomer PL, Hammill DD. Austin, Tex: Pro-Ed; 1982. Test of Language Development-Primary.
114.
Newcomer PL, Hammill DD. 2nd Edition ed. Austin, TX: Pro-Ed; 1988. Test of Language Development-Primary.
115.
Zimmerman IL, Steiner VG, Pond RE. Columbus, Oh: Merrill; 1969. Preschool Language Scale.
116.
Riley GD. A stuttering severity instrument for children and adults. J Speech Hear Disord. 1972; 37: 314322. [PubMed]
117.
Mayo R, Grant WC. Fundamental frequency, perturbation, and vocal tract resonance characteristics in African-American and white American males. ECHO. 1995; 17
118.
Baken RJ, Orlikoff RF. 2nd ed. San Diego, Calif: Singular Publishing Group; 2000. Clinical Measurement of Speech and Voice.
119.
Mayo R, Watkins TR. A Cross-Linguistic Study of Female RFo Characteristics. Presented March 24, 2000: 46th Annual Convention of the North Carolina Speech, Hearing and Language Association: Raleigh, NC.
120.
Harris RP, Helfand M, Woolf SH, et al. Current methods of the US Preventive Services Task Force: A review of the process. Am J Prev Med. 2001; 20: 2135.
121.
DiSimoni FG, Keith RL, Holt DL, Darley FL. Practicality of shortening the Porch Index of Communicative Ability. J Speech Hear Res. 1975; 18: 491497. [PubMed]
122.
DiSimoni FG, Keith RL, Darley FL. Prediction of PICA overall score by short versions of the test. J Speech Hear Res. 1980; 23: 511516. [PubMed]
123.
Hanson WR, Riege WH, Metter EJ, Inman VW. Factor-derived categories of chronic aphasia. Brain Lang. 1982; 15: 369380. [PubMed]
124.
Kotsopoulos S, Walker S, Beggs K, Jones B. A clinical and academic outcome study of children attending a day treatment program. Can J Psychiatry. 1996; 41: 371378. [PubMed]
Help ǀ Contact Bookshelf
AHRQ Evidence Reports
(navigation arrows) Go to previous chapter Go to next chapter Go to top of this page Go to bottom of this page Go to Table of Contents