NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

National Research Council (US) Committee on the Assessment of 21st Century Skills. Assessing 21st Century Skills: Summary of a Workshop. Washington (DC): National Academies Press (US); 2011.

Cover of Assessing 21st Century Skills

Assessing 21st Century Skills: Summary of a Workshop.

Show details

5Measurement Considerations

The assessments described in Chapters 2 through 4 have been designed for a variety of purposes. Some—such as the assessment components of Operation ARIES! Packet Tracer, and the scenario-based learning strategy described by Louise Yarnall—are designed primarily for formative purposes. That is, the assessment results are used to adapt instruction so that it best meets learners’ needs. Formative assessments are intended to provide feedback that can be used both by educators and by learners. Educators can use the results to gauge learning, monitor performance, and guide day-to-day instruction. Students can use the results to assist them in identifying their strengths and weaknesses and focusing their studying. A key characteristic of formative assessments is that they are conducted while students are in the process of learning the material.1

Other assessments—referred to as summative—are conducted at the conclusion of a unit of instruction (e.g., course, semester, school year). Summative assessments provide information about students’ mastery of the material after instruction is completed. They are designed to yield information about the status of students’ achievement at a given point in time, and their purpose is primarily to categorize the performance of a student or a system. The PISA problem-solving assessment and the portfolio assessments used at Envision Schools are examples of summative assessments, as are the annual state achievement tests administered for accountability purposes.

All assessments should be designed to be of high quality: to measure the intended constructs, provide useful and accurate information, and meet technical and psychometric standards. For assessments used to make decisions that have an important impact on test takers’ lives, however, these issues are critical. When assessments are used to make high-stakes decisions, such as promotion or retention, high school graduation, college admissions, credentialing, job placement, and the like, they must meet accepted standards to ensure that they are reliable, valid, and fair to all the individuals who take them. A number of assessments used for high-stakes decisions were discussed by workshop presenters, including the Multistate Bar exam used to award certification to lawyers, the situational judgment test used for admitting Belgian students to medical school, the tests of integrity used for hiring job applicants, and some of the assessment center strategies used to make hiring and promotion decisions.

For the workshop, the committee arranged for two presentations to focus on technical measurement issues, particularly as they relate to high-stakes uses of summative assessments. Deirdre Knapp, vice president and director of the assessment, training, and policy studies division with HumRRO, spoke about the fundamentals of developing assessments. Steve Wise, vice-president for research and development with Northwest Evaluation Association, discussed the issues to consider in evaluating the extent to which the assessments validly measure the constructs they are intended to measure. This chapter summarizes their presentations and lays out the steps they highlighted as fundamental for ensuring that the assessments are of high quality and appropriate for their intended uses.2 Where appropriate, the reader is referred to other sources for more in-depth technical information about test development procedures.

Defining the Construct

According to Knapp, assessment development should begin with a “needs analysis.” A needs analysis is a systematic effort to determine exactly what information users want to obtain from the assessment and how they plan to use it. A needs assessment typically relies on information gathered from surveys, focus groups, and other types of discussions with stakeholders. Detailed information about conducting a needs analysis can be found at [August 2011].

Knapp emphasized that it is important to have a clear articulation of the construct to be assessed: that is, the knowledge, skill, and/or behavior the stakeholders would like to have measured. The construct definition helps the test developer to determine how to measure it. She cautioned that for the skills covered in this workshop, developing a definition and operationalizing these definitions in order to produce test items can be challenging. For example, consider the variability in the definitions of critical thinking that Nathan Kuncel presented or the definitions of self-regulations that Rick Hoyle discussed. In order to develop an assessment that meets appropriate technical standards, the definition needs to be detailed and sufficiently precise to support the development of test items. Test development is less challenging when the construct is more concrete and discrete, such as specific subject-matter or job knowledge.

One of the more important issues to consider during the initial development stage, Knapp said, is whether the assessment needs to measure the skill itself or simply illustrate the skill. For instance, if the goal is to measure teamwork skills, is it necessary to observe the test takers actually performing their teamwork skills? Or is it sufficient that they simply answer questions that show they know how to collaborate with others and effectively work as a team? This is one of the issues that should be covered as part of the needs analysis.

Knapp highlighted the importance of considering which aspects of the construct can be measured by a standardized assessment and which aspects cannot. If the construct being assessed is particularly broad and the assessment cannot get at all components of it, what aspects of the construct are the most important to capture? There are always tradeoffs in assessment development, and careful prioritization of the most critical features can help with decision making about the construct. Knapp advised that once these decisions are made and the assessment is designed, the developer should be absolutely clear on which aspects of the construct are captured and which aspects are not.

Along with defining the construct, it is important to identify the context or situation in which the knowledge, skills, or behaviors are to be demonstrated. Identification of the specific way in which the construct is to be demonstrated helps to determine the type of assessment items to be used.

Determining the Item Types

As demonstrated by the examples discussed in Chapters 2, 3, and 4, there are many item types and assessment methods, ranging from relatively straightforward multiple-choice items to more complex simulations and portfolio assessments. Knapp noted that some of the recent innovations in computer-based assessments allow for a variety of “glitzy” options, but she cautioned that while these options may be attractive, they may not be the best way to assess the targeted construct. The primary focus in deciding on the assessment method is to consider the knowledge, skill, and/or behavior that the test developer would like to elicit and then to consider the best—and most cost-effective—way to elicit it.

Knapp discussed two decisions to make with regard to constructing test items: the type of stimulus and the response mode. The stimulus is what is presented to the test taker, the task to which he/she is expected to respond. The stimulus can take a number of different forms such as a brief question, a description of a problem to solve, a prompt, a scenario or case study, or a simulation. The stimulus may be presented orally, on paper, or using technology, such as a computer.

The response mode is the mechanism by which the test taker responds to the item. Response modes might include choosing from among a set of provided options, providing a brief written answer, providing a longer written answer such as an essay, providing an oral answer, performing a task or demonstrating a skill, or assembling a portfolio of materials. Response modes are typically categorized as “selected response” or “constructed-response,” and constructed-response items are further categorized as “short-answer constructed-response,” “extended-answer constructed-response,” and “performance-based tasks.” Response modes also include behavior checklists, such as those described by Candice Odgers to assess conduct disorders, which may be completed by the test taker or by an observer. The response may be provided orally, on paper, through some type of performance or demonstration, or on a computer.

Knapp explained that choices about the stimulus type and the response mode need to consider the skill to be evaluated, the level of authenticity desired,3 how the assessment results will be used, and practical considerations. If the test is intended to measure knowledge of factual information, a paper-and-pencil test with brief questions and multiple-choice answer options may be all that is needed. If the test is intended to measure more complex skills, such as solving complex, multipart problems, a response mode that requires the examinee to construct an answer is likely to be more useful.

Layered on top of these considerations about the best ways to elicit the targeted skill are practical and technical constraints. Test questions that use selected-response or short-answer constructed-response modes can usually be scored relatively quickly by machine. Test questions that use extended-answer constructed-response or performance-based tasks are more complicated to score. Some may be scored by machine, by programming the scoring criteria, but humans may need to score others. Scoring by humans is usually more expensive than scoring by machine, takes longer, and introduces subjectivity into the scoring process. Furthermore, constructed-response and performance-based tasks take longer to answer, and fewer can be included on a single test administration. They are more resource-intensive to develop and try out, and they usually present some challenges in meeting accepted measurement standards for technical quality. These practical and technical constraints are discussed in more detail below.

Test Administration Issues

How will the assessment be administered to test takers? Where will it be administered? When will it be administered and how often? Who will administer it? There are numerous options for how the test may be delivered to examinees and how they respond to it. Choosing among these options requires consideration of practical constraints.

A small assessment program, with relatively few examinees and infrequent administrations, has many options for administration, Knapp advised. For example, performance-based tasks that involve role playing, live performances, or that are administered one-on-one (one test administrator to one examinee) are much more practical when the examinee volume is small and test administrations are infrequent. When the examinee volume is large, performance-based tasks may be impractical because of the resources they require. The resources required for performance-based tasks can be reduced if they can be presented and responded to via computer, particularly if the scoring can be programmed as well.

Despite the resource required, several currently operating large standardized testing programs make use of performance-based tasks. As described in Chapter 2, the Multistate Bar Exam includes a performance-based component with a written response and is administered to approximately 40,000 candidates each year. Test takers pay $20 to take this assessment.

Another example is Step 2 of the United States Medical Licensing Examination (USMLE), which includes a performance-based component. The Clinical Skills portion of the exam evaluates medical students’ ability to gather information from patients, perform physical examinations, and communicate their findings to patients and colleagues. The assessment uses standardized patients to accomplish this. Standardized patients are humans who are trained to pose as patients. They are trained in how they should respond to the examinee’s questions in order to portray certain symptoms and/or diseases, and they are trained to rate the examinee’s skills in taking histories from patients with certain symptoms. (For more information, see [August 2011].) Approximately, 33,300 individuals took the exam between July 1, 2009, and June 30, 2010 (see [August 2011]). This exam is expensive for test takers; the fee to take the test is $1,100.

A third example is the portfolio component of the assessment used to award advanced level certification for teachers by the National Board for Professional Teaching Standards (NBPTS). This assessment evaluates teachers’ ability to think critically and reflect on their teaching and requires that teachers assemble a portfolio of written materials and videotapes. Approximately 20,000 teachers take the assessment each year (Mary Dilworth, vice president for research and higher education with the NBPTS, personal communication, May 31, 2011), and scores are available within 6 to 7 months (see [August 2011]). This assessment is also expensive; examinees pay $2,500 to sit for the exam.


Knapp noted that if the choice is to use extended constructed-response or performance-based tasks, decisions must be made about how to score them. These types of open-ended responses may be scored dichotomously or polytomously. Dichotomous scoring means the answer is scored either correct or incorrect. Polytomous scoring means a graded scale is used, and points are awarded depending on the quality of the response or the presence of certain attributes in the response. Either way, a scoring guide, or “rubric,” must be developed to establish the criteria for earning a certain score. The scoring criteria may be programmed so a computer does the scoring, or humans may be trained to do the scoring. When humans do the scoring, substantial time must be spent on training them to apply the scoring criteria appropriately. Since scoring constructed-response and performance-based tasks requires that scorers make judgments about the quality of the answer, the scorers need detailed instructions on how to make these judgments systematically and in accord with the rubric. Likewise, when constructed-response items are scored by computer, the computer must be “trained” to score the responses correctly, and the accuracy of this scoring must be closely monitored.

For some purposes, it is useful to set “performance standards” for the assessment. This might mean determining the level of performance considered acceptable to pass the assessment. Or it may mean classifying performance into three or more categories, such as “basic,” “proficient,” and “advanced.” Making these kinds of performance-standards decisions requires implementing a process called “standard setting.” For further information about setting standards, see Cizek and Bunch (2007) or Zeiky, Perie, and Livingston (2008).

Technical Measurement Standards

Any assessment used to make important decisions about the test takers should meet certain technical measurement standards. These technical guidelines are laid out in documents such as the Standards for Educational and Psychological Testing (American Educational Research Association, American Psychological Association, and National Council on Measurement in Education, 1999), hereafter referred to as the Standards. Knapp and Wise focused on three critical technical qualities particularly relevant for assessments of the kinds of skills covered in the workshop, given the challenges in developing these assessments: reliability, validity, and fairness.


Reliability refers to the extent to which an examinee’s test score reflects his or her true proficiency with regard to the trait being measured. The concern of reliability is the precision of test scores, and, as explained in more detail later in this section, the level of precision needed depends on the intended uses of the scores and the consequences associated with these uses (see also American Educational Research Association, American Psychological Association, and National Council on Measurement in Education, 1999, pp. 29–30).

Reliability is evaluated empirically using the test data, and several different strategies can be used for collecting the data needed to calculate the estimate of reliability. One strategy involves administering the same form4 of the test or parallel forms of the test to the same group of examinees at independent testing sessions. When multiple administrations are impractical or unavailable, an alternative strategy involves estimating reliability from a single test form given on a single occasion. For this type of reliability estimate, the test form is divided into two or more constituent parts, and the consistency across these parts is determined using an estimate such as coefficient alpha or a split-half reliability coefficient. Each of these strategies for estimating reliability examines the precision of scores in relation to specific sources of error. Additional information about estimating reliability is available in Haertel (2006) and Traub (1994).

For tests that are scored by humans, another type of reliability information is commonly reported. When humans score examinee responses, they must make subjective judgments based on comparing the scoring guide and criteria to a particular test taker’s performance. This introduces the possibility of scoring error associated with human judgment, and it is important to estimate the impact of this source of error on test scores. One estimate of reliability when human scoring is used is “inter-rater agreement,” which is obtained by having two raters score each response and calculating the correlation between these scores. Knapp indicated that an estimate of inter-rater agreement provides basic reliability information, but she cautioned that it is not the only type of reliability evidence that should be collected when responses are scored by humans. A more complete data collection strategy involves generalizability analysis, which can be designed to examine the precision of test scores in relation to multiple sources of error, such as testing occasion, test form, and rater. Additional information about generalizability analysis is available in Shavelson and Webb (1991).

Reliability is typically reported as a coefficient that ranges from 0 to 1. The level of reliability needed depends on the nature of the test and the intended use of the scores: there are no absolute levels of reliability that are considered acceptable. When test results are used for high-stakes purposes, such as with a high school exit exam, reliability coefficients in the range of .90, or higher are typically expected. Lower reliability coefficients may be acceptable for tests used for lower stakes purposes, such as to determine next steps for instruction.

Generally, all else being equal, the more items on a test, the higher the reliability. This is because longer tests obtain a more extensive sample of the knowledge, skills, and behaviors being assessed than do shorter tests. Tests that rely on open-ended questions, such as extended-answer constructed-response and performance-based tasks, tend to consist of fewer items because these types of questions take more time to answer than do multiple-choice items. For practical reasons, such as the amount of testing time available, and because of concerns about examinee fatigue, tests can only include a limited number of these types of questions. Thus, tests that make use of open-ended questions tend to be less reliable than tests that primarily use multiple-choice questions, in part, because they contain fewer test questions. In addition, tests that require that judgments be made about the quality of the response—either when humans do the scoring or when scoring is done by artificial intelligence—introduce error associated with these judgments, which also tends to reduce reliability levels. Knapp advised that these factors should be considered in relation to the interpretations and uses of test scores in making decisions about the types of questions used on the test.

Two other measures of score precision to consider are the standard error of measurement and classification consistency. The standard error of measurement provides an estimate of precision that is on the same scale as the test scores (i.e., as opposed to the 0 to 1 scale of a reliability coefficient). The standard error of measurement can be used to calculate a confidence band for an individual’s test score. Additional information on standard errors of measurement and confidence bands can be found in Anastasi (1988, pp. 133–137), Crocker and Algina (1986, pp. 122–124), and Popham (2000, pp. 135–138), and the Standards (American Educational Research Association, American Psychological Association, and National Council on Measurement in Education, 1999, pp. 28–31).

The third measure of precision—classification consistency—is most relevant when tests are used to classify the test takers into performance categories, such as “basic, “proficient,” or “advanced,” or simply as “proficient” or “not proficient,” or “pass” and “fail.” When important consequences are tied to test results, classification consistency should be examined. Classification consistency estimates the proportion of test takers who would be placed in the same category upon repeated administrations of the test. In this case, the issue is the precision of measurement near the cut score (the score used to classify test takers into the performance categories). Additional information about classification consistency can be found in the Standards (American Educational Research Association, American Psychological Association, and National Council on Measurement in Education, 1999, p. 30).

It is important to point out that for some of the more innovative assessments, these measures of precision cannot be estimated. As Knapp put it, “computer-based technology has gotten way ahead of the capabilities of psychometric tools.” For example, at present there is no practical way to estimate reliability for some of the computerized assessments, such as those that are part of Operation ARIES! or Packet Tracer.


Validity refers to the extent to which the assessment scores measure the skills that they purport to measure. As Steve Wise framed it, validity refers to the “trustworthiness of the scores as being true representations of a student’s proficiency in the construct being assessed.” Validation involves the evaluation of the proposed interpretations and uses of the test results. Validity is evaluated based on evidence—both rational and empirical, qualitative and quantitative. This includes evidence based on the processes and theory used to design and develop the test as well as a variety of kinds of empirical evidence, such as analyses of the internal structure of the test, analyses of the relationships between test results and other outcome measures, and other studies designed to evaluate the extent to which the intended interpretations of test results are justifiable and appropriate. Wise and Knapp both emphasized that evaluation of validity and collection of validity evidence is a continuing, ongoing process that should be regularly conducted as part of the testing program. See Messick (1989) and Kane (2006) for further information about validation.

Wise noted that many factors can affect the trustworthiness of the scores, but two are particularly relevant for the issues raised in the workshop: motivation to perform well and construct irrelevant variance. One of the most important influences on motivation to perform well is the ways in which the scores are used—the interpretations made of them, the decisions about actions to take based on those interpretations, and the consequences (or stakes) attached to these decisions. When the stakes are high, Wise explained, the incentive to perform well is strong. The more important the consequences attached to the test results, the higher the motivation to do well. Motivation to perform well is critical, Wise stressed, in obtaining test results that are trustworthy as true representations of a student’s proficiency with the construct. If the test results do not matter or do not carry consequences for students, they may not try their best, and the test results may be a poor representation of their proficiency level.

Motivation to do well can also bring about perverse behaviors, Wise cautioned. When test results have important consequences for students, examinees may take a number of actions to improve their chances of doing well—some appropriate and some inappropriate. For example, some students may study extra hard and spend long hours preparing. Others may find inappropriate short cuts that work to invalidate the test results, such as finding out the test questions beforehand, copying from another test taker, or bringing disallowed materials, such as study notes, into the test administration. These types of behaviors can produce scores that are not accurate representations of the students’ true skills.

For the kinds of skills discussed at this workshop, motivation to do well can introduce a second source of error, which Wise described as “fake-ability.” Some of the constructs have clearly socially acceptable responses. For example, if the assessment is designed to measure constructs such as adaptability, teamwork, or integrity, examinees may be able to figure out the desired response and respond in the socially acceptable way, regardless of whether it is a true representation of their attitudes or behaviors. Another concern with these kinds of items is that they may be particularly “coachable.” That is, those who are helping a test taker prepare for the assessment can teach the candidate strategies for scoring high on the assessment without having taught the candidate the skill or construct being assessed. Thus, the score may be influenced more by the candidate’s skill in test taking strategies than his or her proficiency on the construct of interest.

A related issue is construct irrelevant variance. Problems with construct irrelevant variance occur when something about the test questions or administration procedures interferes with examinees’ ability to assess the intended construct. For instance, if an assessment of teamwork is presented in English to students who are not fluent in English, the assessment will measure comprehension of English as well as teamwork skills. This may be acceptable if the test is intended to be an assessment of teamwork skills in English. If not, it will be impossible to obtain a precise estimate of the examinee’s ability on the intended construct because another factor (facility with English) will interfere with demonstration of the true skill level. This can be a particular concern with some of the more innovative item types, such as those that are computer based or involve strategies such as simulations or role-playing, Wise noted. If familiarity with the item type or assessment strategy gives students an advantage that is not related to the construct, the assessment will give a flawed portrayal of the examinee’s skills. This influences the validity of the inferences being made about the test scores.


Fairness in testing means the assessment should be designed so that test takers can demonstrate their proficiency on the targeted skill without irrelevant factors interfering with their performance. As such, fairness is an essential component of validity. Many attributes of test items can contribute to construct irrelevant variance, as described above, and thus require skills that are not the focus of the assessment. For instance, suppose an assessment is intended to measure skill in mathematical problem solving, but the test items are presented as word problems. Besides assessing math skill, the items also require a certain level of reading skill. Examinees who do not have sufficient reading skills will not be able to read the items and thus will not be able to accurately demonstrate their proficiency in mathematical problem solving. Likewise, if the word problems are in English, examinees that do not have sufficient command of the English language will not be able to demonstrate their proficiency in the math skills that are the focus of the test.

Additional considerations about fairness may arise in relation to cultural, racial, and gender issues. Test items should be written so that they do not in some way disadvantage the test taker based on his or her racial/ethnic identification or gender. For example, if the math word problem discussed above uses an example more familiar or accessible to boys than girls (e.g., an example drawn from sports), it may give the boys an unfair advantage. The same may happen if the example is more familiar to students from a white Anglo-Saxon culture than to racial/ethnic minority students. Many of the skills covered in the workshop present considerable challenges with regard to fairness. For example, cultural issues may cause differential performance on assessments of skills in communication, collaboration, or other interpersonal characteristics. Social inequities related to income, family background, and home environment may also cause differential performance on assessments. Students may not have equal opportunities to learn these skills.

The measurement field has a number of ways to evaluate fairness with assessments. The Standards (American Educational Research Association, American Psychological Association, and National Council on Measurement in Education, 1999, pp. 71–106) provides a more complete discussion.

The Relationship Between Test Uses and Technical Qualities

Knapp and Wise both emphasized that when test results are used for summative purposes and high-stakes decisions are based on the results, the tests are expected to meet high technical standards to ensure decisions are based on accurate and fair information. For example, if a test is used for pass/fail decisions to determine who graduates from high school and who does not, the measurement accuracy of the scores needs to be high. Meeting high technical standards can be challenging and expensive because it requires a number of actions to be taken during the test development, administration, and scoring stages. For example, when tests are used for high-stakes purposes, reliability and classification consistency should be high. Test items will need to be kept secure. They cannot be reused multiple times because students remember them and pass the information on to others. Having to continually replenish the item pool is expensive and resource intensive, and it requires developing multiple forms of the test.

If different forms of the test are used, efforts have to be made to create test forms that are as comparable as possible. When tests are comprised of selected-response items or short constructed-response items, quantitative methods can be used to ensure that the scores from different test forms are equivalent. Statistical procedures—referred to as “equating” or “linking”—can be used to put the scores from different forms on the same scale and achieve this equivalence. For a number of reasons, linking or equating is usually not possible when tests are comprised solely of extended constructed-response items. In this situation, there is no straightforward way to ensure that the test forms are strictly comparable and test scores equivalent across different forms. See Kolen and Brennan (2004) or Holland and Dorans (2006) for additional explanation of linking.

Thus, the test developer is often faced with a number of dilemmas. Constructed-response and performance-based tasks may be the most authentic way to assess 21st century skills. However, achieving high technical standards with these item types is challenging. When tests do not meet high technical standards, the results should not be used for high-stakes decisions with important consequences for students. But, when the results do not impact students’ lives in important ways (i.e., “they do not count”), students may not try their best. Raising the stakes means increasing the technical quality of the tests. Test developers must face these issues and set priorities as to the most important aspects of the assessment. Is it more important to have authentic test items or to meet high reliability standards? Test developers are often faced with competing priorities and will need to make tradeoffs. Decisions about these tradeoffs will need to be guided by the goals and purposes of the assessment as well as practical constraints, such as the resources available.



The reader is referred to Andrade and Cizek (2010) for further information about formative assessment and the difference between formative and summative assessment.


Knapp’s presentation is available at http://www7​.national-academies​.org/bota/21st​_Century_Workshop_Knapp.pdf [August 2011]. Wise’s presentation is available at http://www7​.national-academies​.org/bota/21st​_Century_Workshop_Wise.pdf [August 2011].


Authenticity refers to how closely the assessment task resembles the real-life situation in which the test taker is required to use the skill being assessed. As described earlier, the level of authenticity desired is an issue that should be addressed as part of a needs analysis.


A form is the specific collection of items or tasks that are included on the test.

Copyright © 2011, National Academy of Sciences.
Bookshelf ID: NBK84220


Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...