NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

National Research Council (US) Board on Science Education. Exploring the Intersection of Science Education and 21st Century Skills: A Workshop Summary. Washington (DC): National Academies Press (US); 2010.

Cover of Exploring the Intersection of Science Education and 21st Century Skills

Exploring the Intersection of Science Education and 21st Century Skills: A Workshop Summary.

Show details

7Assessment of 21st Century Skills

This chapter summarizes two presentations and a discussion of the assessment of 21st century skills. The first section of the chapter focuses on methods used by some large corporations to assess the 21st century skills of current employees and job applicants. The second section summarizes a commissioned paper focusing on assessment of 21st century skills in educational settings. The final section of the chapter summarizes discussion of the presentation and paper.


Janis Houston opened her presentation (Houston and Cochran, 2009) by highlighting the purposes of corporate assessment. She noted that the purposes for which any test is used (whether in education or in making employment decisions) affects the methods used in developing the test. In the world of employment, assessments are used for selection (to try to predict whether an individual will perform well in the future), for promotion, for certification (to ensure that an individual possesses a certain standard body of knowledge), and to identify training and development needs.

Houston then outlined the types of assessments used by large organizations, which include multiple-choice tests of cognitive abilities (e.g., mathematics) and noncognitive characteristics (e.g., personality type), structured interviews, situational judgment tests, role plays, group exercises, in-basket exercises, work samples, and performance standards/appraisal. She discussed three of these types in greater depth. All three are designed to assess a candidate’s readiness for a specific job, which may include assessment of 21st century skills if such skills are important for successful performance in the job.

Role Play

In a role play assessment, the candidate for promotion is provided with written information about a realistic situation that may involve a nonroutine problem. After a period of time to prepare for the role play, the candidate presents her or his response to the situation to a panel of trained assessors. The assessors rate the response using behaviorally anchored rating scales, which describe specific behaviors.

For example, a candidate for promotion may be asked to play the role of a newly promoted insurance investigator who has just been put in charge of a large-scale insurance fraud investigation and is about to meet with a claims adjuster and an FBI agent. The candidate is told that it is important to ensure that the investigation is led by his or her company, rather than the FBI, and is given an hour to prepare for the 30-minute meeting. The managers conducting the assessment also prepare another person to participate in the role play, in the role of the FBI agent. This person is instructed to be very aggressive, to interrupt the candidate frequently, to push the candidate to turn the case completely over to the FBI, and to provide the candidate with new information about the criminal past of the suspects.

Houston explained that trained assessors use a scoring system to evaluate the candidate’s adaptability, as displayed in the role play. This system guides assessors to award few points for adaptability if the candidate acts flustered or overwhelmed by new information and more points if the candidate seamlessly adjusts to new information.

Developing and administering role plays involves several challenges, Houston said. First, substantial input from subject-matter experts is necessary to identify appropriate problems or situations, and personnel testing experts are also needed to create the role play materials and behaviorally anchored ratings, so the process is labor-intensive. Second, because only one candidate can participate at a time, this form of assessment is expensive. Many role plays involve additional role players along with the candidate, and these other role players must be paid for their time. Finally, a role play can be scored only by engaging the time and expertise of multiple trained assessors. Unlike a typical multiple-choice test used in educational settings, a role play cannot be electronically scored.

Group Exercise

The development, administration, and scoring of a group exercise are similar to the role play, Houston said. The critical difference is that candidates work in groups to address a problem or respond to a situation, making it possible to assess their interactive skills, such as negotiation, persuasion, and teamwork.

Houston said that the costs and challenges involved in this form of assessment are similar to those in a role play, requiring the time of subject-matter experts and test developers. It must be carefully designed in order to manage the group interaction, and it often requires the time of other trained role players in addition to the candidates. Finally, scoring the group exercise requires the participation of multiple trained assessors; it cannot be electronically scored.

Houston noted that corporations are willing to pay the high costs ($250,000–$500,000) of creating role plays and group exercises, because these forms of assessment are well accepted by candidates, and they yield more information about a candidate’s skills than a multiple-choice written test. These two forms of assessment, she said, help organizations to hire more highly qualified candidates, resulting in increased productivity and job performance.

Situated Judgment Tests

Houston said that a situated judgment test is less expensive than either a role play or a group exercise. In this test, the candidate is presented with a realistic hypothetical situation and a list of five to eight possible responses to the situation. The candidate may be asked to select the most and least effective option or only the most effective option. The scoring of the test is based on the effectiveness of the options the candidate selects.

Houston presented an example of a situated judgment test designed to assess adaptability. The candidate is presented with a situation in which he or she is working intensely to finish a report for his or her supervisor, when the supervisor calls to say he or she would like to touch base about the candidate’s progress on another project in one hour. The optional responses range from getting as much of the report done as possible within the hour to explaining to the supervisor about the unfinished report and asking for an extension on the report deadline.

Houston said this type of test is much less expensive to administer than either a role play or a group exercise. The test is developed by identifying the skills to be assessed, creating realistic situations or problems in consultation with subject-matter experts, generating multiple response options for each situation, and devising a scoring system to determine the effectiveness of each optional response. The test situations can be presented either in print or by video, and the candidate responds in writing. Like educational assessments, the situated judgment test can be administered to a large group simultaneously and can be electronically scored.


Houston concluded by saying that development and administration of all three types of corporate assessments is a labor-intensive, expensive process that may not be practical for use in large-scale educational assessment. However, relative to the role play and the group exercise, situated judgment tests are far easier to administer and score; they can easily be administered to large groups and scored by computer.


Maria Ruiz-Primo (University of Colorado Laboratory for Educational Assessment, Research, and Innovation) presented a summary of her paper on assessment of 21st century skills (Ruiz-Primo, 2009).

Model of Assessment Development

Ruiz-Primo observed that her approach to developing a framework to assess 21st century skills in the context of science is based on a theoretical model for the development and evaluation of assessments, which takes the form of a square. The first step in the model is to define the construct—the knowledge, skills, or other attributes to be assessed. Based on this definition, the developers use conceptual analysis to identify behaviors, responses, or activities most representative of the construct in order to create an observation model. Next, the developers use the observation model as the basis for developing the assessment, with specific situations designed to elicit the behaviors, responses, or activities included in the observation model. After administering the assessment, the developers conduct empirical analysis to interpret the results and analyze whether the evidence collected supports the inferences about the knowledge, skills, or other attributes of the construct. The result of this analysis may lead to revision of the construct, the observation model, or the assessment itself.

Defining the Construct

Ruiz-Primo began by defining the construct of 21st century skills in the context of science. She identified dimensions of the five skills and the research underlying each dimension. She then compared these dimensions with three other recent models: (1) the framework developed by the Partnership for 21st Century Skills (2009b); (2) the Standards for the 21st Century Learner of the American Association of School Librarians (2009); and (3) the enGauge 21st Century Skills developed by the North Central Regional Education Laboratory and the Metiri Group (Lemke et al., 2003). She observed that the other three models most strongly emphasized communication and nonroutine problem-solving skills.

Turning to the context of science education, Ruiz-Primo compared dimensions of the five skills with the definition of science proficiency developed in a recent Board on Science Education review of the research on science learning in grades K-8 (National Research Council, 2007a). The review concluded that students who are proficient in science should:

  1. know, use, and interpret scientific explanations of the natural world;
  2. generate and evaluate scientific evidence and explanations;
  3. understand the nature and development of scientific knowledge; and
  4. participate productively in scientific practices and discourse.

Ruiz-Primo found that dimensions of two of the five 21st century skills—complex communication/social skills and nonroutine problem solving—were most closely aligned with this definition of science proficiency.

Proposed Construct

Building on these two analyses, Ruiz-Primo proposed a construct of 21st century skills in the context of science that includes three domains:

  1. Dispositions, general inclinations or attitudes of mind;
  2. Cross-functional skills (cognitive skills that are likely to be used in any domain); and
  3. Science knowledge.

She included two of the 21st century skills—adaptability and self-management/self-development—in the domain of dispositions, and two others—complex communication/social skills and nonroutine problem-solving skills—in the domain of cross-functional skills. The science knowledge domain is defined by the four strands of science proficiency listed above (National Research Council, 2007a). The resulting construct is represented in Figure 7-1.

FIGURE 7-1. Construct domains of 21st century skills in the context of science education.


Construct domains of 21st century skills in the context of science education. SOURCE: Ruiz-Primo (2009).

Types of Science Knowledge

Expanding her analysis of science content, Ruiz-Primo explained that her research group has proposed an approach to understanding and developing measures of science achievement based on the idea of types of knowledge (Li, 2001; Li, Ruiz-Primo, and Shavelson, 2006; Ruiz-Primo, 1997, 1998, 2003; Shavelson and Ruiz-Primo, 1999). They include

  • Declarative knowledge: knowing that. This type includes knowledge that ranges from discrete and isolated content elements, such as terminology, facts, or specific details, to a more organized knowledge forms, such as statements, definitions, knowledge of classifications, and categories.
  • Procedural knowledge: knowing how. This type involves knowledge of skills, algorithms, techniques, and methods. It usually takes the form of if-then production rules or a sequence of steps (e.g., measuring temperature using a thermometer; applying an algorithm to balance chemical equations; adding, subtracting, multiplying, and dividing whole numbers).
  • Schematic knowledge: knowing why. This type involves more organized bodies of knowledge, such as schemas, mental models, or “theories” (implicit or explicit) that are used to organize information in an interconnected and systematic manner.
  • Strategic knowledge: knowing when, where, and how to apply knowledge. “The application of strategic knowledge involves navigating the problem, planning, monitoring, trouble-shooting, and synchronizing other types of knowledge. Typically, strategic knowledge is used when one encounters ill defined tasks” (Li and Tsai, 2007, p. 14).

Ruiz-Primo proposed that this typology of knowledge has three important implications for science assessment. First, it can be applied to determine what types of knowledge are being measured by a particular assessment; second, it can be used to interpret student scores; and third, it can be applied to design or select assessment tasks that are aligned with instructional goals.

An Approach to Developing and Evaluating Assessments

Ruiz-Primo explained that defining the construct represents the first step in developing and evaluating an assessment. The next steps include development of observation models, specifying those aspects of a student’s response to a test item that would be valued as evidence of the construct, and linking the observation models with the design of the assessment.

Retrospective Logical Analysis

Assessment researchers use retrospective logical analysis to analyze assessment tasks that have already been developed. In this type of analysis, they review how the task elicits the targeted knowledge and influences students’ thinking and responses. Ruiz-Primo identified four criteria that can be applied in retrospective logical analysis:

  1. Task demands: what students are asked to perform (e.g., define a concept or provide an explanation);
  2. Cognitive demands: inferred cognitive processes that students may act on to provide responses (e.g., recall a fact or reason with a model);
  3. Item openness: the extent of constraints in the response (e.g., selecting versus generating responses or requiring information only found in a task versus information that can be learned from the task); and
  4. Complexity of the item: the diverse characteristics of an item, such as familiarity to students, reading difficulty, and the extent to which it reflects experiences that are common to all students.

Returning to the goal of assessing complex communication and nonroutine problem solving, Ruiz-Primo said that the assessment items should be designed to yield evidence of the solving of complex problems. She then identified several dimensions related to the complexity of a problem (see Table 7-1). One is the structure of a problem, which is determined by the test developer. A problem may be well structured or ill structured. Another dimension is whether the problem is routine or nonroutine, and this depends on whether the examinee has already learned procedures to solve this type of problem. In another dimension, a “rich” problem requires activities and subtasks, while a lean problem does not. Yet another dimension is the extent to which it requires prior exposure to the topic in the context of school. Other dimensions include whether the solution is approached individually or in collaboration with others and whether there is a time constraint to solve the problem. Finally, she observed that a problem may vary in the extent of communication required to respond, with a selected or short-answer item requiring less writing and a constructed-response item requiring more writing. Ruiz-Primo integrated all of these dimensions into a framework for retrospective analysis of existing assessment tasks, which she applied to review sample assessment tasks.

TABLE 7-1. Dimensions of Problem Complexity.


Dimensions of Problem Complexity.

Review of Sample Assessment Tasks

Ruiz-Primo explained that she reviewed several existing assessment items to consider their ability to measure adaptability, complex communications, nonroutine problem solving, and self-management/self-development. She selected assessment items from science as well as other domains, as requested by the planning committee. Returning to the construct she proposes, she asked what types of assessment tasks would yield evidence that a student can “generate and evaluate scientific evidence and explanations” (National Research Council, 2007a, p. 2). In answer to her own question, she said that tasks should require students to do something or critique examples of scientific evidence and explanations, and they should involve ill-structured or nonroutine problems. In addition, she proposed that the tasks should be constrained in terms of time allowed or collaborations required, in order to measure not only complex communication and nonroutine problem solving, but also adaptability and self-management. Based on all of these considerations, she concluded that such assessment methods as essays and performance assessment tasks were promising candidates to assess the four 21st century skills. She then reviewed four assessment items, discussed below.

Analytic Writing Task

Ruiz-Primo described a writing task from the Collegiate Learning Assessment (CLA). Developed by the Council for Aid to Education and the RAND Corporation, the goal of the CLA is to measure “value added” by educational programs in colleges and universities. The writing task Ruiz-Primo examined presents a real-life scenario and allows students 30 minutes to construct a written argument for or against the principal’s decision to oppose the opening of fast-food restaurants near the school (see Box 7-1).

Box Icon

BOX 7-1

Sample CLA Analytic Writing Task: Critique an Argument. A well-respected professional journal with a readership that includes elementary school principals recently published the results of a two-year study on childhood obesity. (Obese individuals are (more...)

Ruiz-Primo concluded that this problem is not well structured, increasing its complexity, because students are not able to determine what an acceptable answer would be or how to arrive at that answer. She said it is unclear whether the problem is routine or nonroutine; students being examined may or may not have already learned a routine procedure to approach it. The problem may require subtasks, such as making a list of pros and cons first before writing the argument. It does not appear to require an academic procedure taught at the school. Although CLA administers this type of task individually, it could also be a collaborative task. Because it is timed, students need to self-manage their time to finish it in 30 minutes. The context of the task can be considered “social,” something that students might observe in their own community.

According to the CLA framework, the test measures skills that are applicable to a wide range of academic subjects and are also valued by employers, including critical thinking, analytic reasoning, problem solving, and written communication (Klein et al., in press). This particular item emphasizes written communication. Ruiz-Primo noted that the rubric used to score the item indicates that student performance is evaluated based on dimensions of the four strands of science proficiency (National Research Council, 2007a), including evaluation of evidence, analysis and synthesis of evidence, and drawing conclusions as well as on criteria related to the quality of the writing.

Based on this analysis, Ruiz-Primo recommended considering similar tasks for assessing 21st century skills.

Performance Task

The next example, also from CLA, was a 90-minute performance task. This task invites students to pretend that they work for a company and that their boss has asked them to evaluate the pros and cons of purchasing an airplane (called the SwiftAir 235) for the company. The task indicates that concern about this purchase has risen with the report of a recent SwiftAir 235 crash. Students are invited to respond in a real-life manner by writing a memorandum (the response format) to their boss analyzing the pros and cons of alternative solutions, anticipating possible problems and solutions to them, recommending what the company should do, and focusing on evidence to support their opinions and recommendations. The scoring of student performance on this item recognizes and evaluates alternative justifiable solutions to the problem and alternative solution paths.

Ruiz-Primo said the problem is ill defined, lacking clear information on the characteristics of the correct response. Although routine procedures to solve the problem depend on the knowledge that the examinee brings to the situation, many examinees will have some sense of how to approach it, such as to read the information provided. It is a rich problem, since examinees are provided detailed information about the SwiftAir 235, in particular, and airplane accidents, in general. Because some of the provided information is relevant and sound, while some is not, part of the problem involves determining what is relevant. The problem does not appear to require academic exposure to solve it, and it is an individual problem. Because the problem is timed, students must apply self-management skills. Finally, the problem context is job related.

Science Achievement Task

Next, Ruiz-Primo presented a task from the Program for International Student Assessment 2006 science test (see Box 7-2), referred to as the School Milk Study. Both of the questions are designed to elicit some of the behaviors, responses, and actions defined in the observation (evidence) models for the 21st century science skill to “generate and evaluate evidence and explanations,” (National Research Council, 2007a, p. 2).

Box Icon

BOX 7-2

The School Milk Study. In 1930, a large-scale study was carried out in the schools in a region of Scotland. For four months, some students received free milk and some did not. The head teachers in each school chose which of their students received milk. (more...)

She explained that the item is well structured, with one correct response, reducing its complexity. As with the previous examples, the extent to which the problem is routine or nonroutine may vary, depending on how much experience a given student has in identifying scientific problems and using evidence to support explanations. The problem does not require students to carry out different activities or carry out subtasks and is based on processes and procedures learned in school. It is designed for individual response, and it is timed in relation to other items in the larger test. The item has a historical setting in a global context, and it demands no written communication.

She concluded that the item is designed primarily to assess declarative knowledge in the domain of science. Because the item is constrained, requiring selection of a correct response, it does not require the student to display complex written communication skills. The constraints reinforce the task and cognitive demands placed on students.

A Technology-Rich Science Task

Ruiz-Primo then presented a sample item that was field-tested in the National Assessment of Educational Progress (NAEP) Technology-Based Assessment Project (Bennett et al., 2007). Focusing on physical science, the item includes three problems, asking the student to determine how different payload masses affect the altitude of a balloon. Students are presented with a search scenario requiring them to locate and synthesize information about a scientific helium balloon, and a simulation scenario requiring them to experiment to solve problems. Students can see animated displays after manipulating the mass carried by the balloon and the amount of helium contained in the balloon.

Ruiz-Primo said that the problems are well structured, with a clearly correct answer. The problem appears routine, as most students know how to gather information from the World Wide Web. The simulation scenario is a fixed procedure that seems to require changing the values of the different variables; what is important is which values to select on each trial. The problems are rich, involving several activities and subtasks. Although it involves technical terms, the problem may not require previous learning of a specific procedure. The problem is from a science context, rather than a real-world context, and is to be solved individually within time constraints.

She concluded that this item taps mainly procedural knowledge, as students are asked to carry out procedures to search for information on the World Wide Web or to conduct the simulation. In addition, students do not have a choice about how best to represent the data collected; it is given. The item is constrained, requiring students to select from a set of tools, and these constraints reinforce the task and cognitive demands placed on students. Finally, it requires some written communication.

Measuring Strategic Knowledge

Noting that a review of the TIMSS 1999 Science Booklet 8 test identified no items that measure strategic knowledge (Li, Ruiz-Primo, and Shavelson, 2006), Ruiz-Primo said that computer technology now makes it possible to track the strategies students use when applying information to solve problems. She pointed to the example of the Interactive Multi-Media Exercises (IMMEX; Case, Stevens, and Cooper, 2007; Cooper, Stevens, and Holme, 2006), a system that presents students with real-world complex science problems to solve in an online environment. The program has been used to assess science learning in K-12, undergraduate, and medical education. It tracks students’ actions and data-mining strategies to arrive at the solution, grouping the strategies into types and identifying pathways into specific strategy types. It provides reliable and repeatable measures of students’ problem-solving skills (Case, Stevens, and Cooper, 2007; Cooper, Sandi-Urena, and Stevens, 2008). In addition, because it offers the possibility for collaborating to solve a problem, the system may be used to elicit students’ communication skills, skills in considering others’ opinions, adaptability, and self-management.


On the basis of her analysis of these items, Ruiz-Primo concluded that similar tasks should be considered for assessing 21st century skills. She offered four recommendations for future research and development of assessments of 21st century skills. First, she said, it is important to more carefully define the 21st century skills of interest. Second, for the purposes of developing large-scale assessments, it is important to identify the most critical skills, as she did when focusing on nonroutine problem solving and complex communication. Third, she said it is important to define the purposes of assessments designed to measure 21st century skills, such as to provide information for school accountability, to evaluate individual student progress, to focus public attention on educational concerns, or to change educational practices by influencing curriculum and instruction. She observed that different purposes require different sources of evidence to evaluate the validity of the assessment. Fourth, she said that computer-based technology can support the development, administration, and scoring of large-scale assessments of 21st century skills.


Session moderator Marcia Linn thanked Ruiz-Primo and Houston, observing that they had posed important questions about how to define the construct of 21st century skills, as well as how to measure this construct. She observed that Houston had demonstrated the importance of the goal of assessing 21st century skills by showing how much money private firms are willing to invest in assessments of these skills, as well as the cost savings that result from the use of these assessments.

She suggested research to develop work samples representing students’ ability to apply scientific knowledge to every aspect of their lives. Noting that the presenters had suggested using technology to assess 21st century skills, Linn said that technology offers opportunities for synergy between the curriculum and the assessment. For example, she said, in her research team’s online instruction, many of the student activities could be used as indicators of their 21st century skills. Linn suggested that teachers could score new types of science assessments capable of measuring 21st century skills. She said that, in the Netherlands, where complex assessments are used, schools send the completed assessments to other schools for grading. This process offers learning opportunities for teachers as well as students.

Linn then invited the audience to write down their reflections and questions about the session in their carbonless notebooks. After several minutes, she invited the audience to pose questions about the session and also to recommend policies or programs to support development of 21st century skills in science education.

The speaker and other workshop participants offered the following responses:

  • Some states have already adopted educational standards incorporating 21st century skills and are beginning to develop assessments aligned with these new standards.
  • Educational assessments are standardized for all students, in contrast to the process in the corporate world, which involves tailoring each assessment to a particular workplace or job.
  • Corporate assessment must be tailored in order to be most relevant for the job and to be the most valid predictor of future job performance.
  • In education, if the purpose is large-scale assessment, the same standardized test should be administered to all students. However, if a teacher wants to know how well her students have learned following a unit of instruction focused on 21st century skills, it may be appropriate to create a unique assessment.
  • Although educational assessments measure individual skills, the value of a skill such as adaptability may be realized in groups, rather than as an attribute of separate individuals. From this perspective, it is important to think about assessment of people in groups.
  • Online assessments can be manipulated to engage students in solving a problem with others who are not physically present. This approach could be used to assess a dimension of self-management—the ability to work as a member of a virtual team (Houston, 2007). Assessments can also be manipulated to change the status of the test-taker in order to assess adaptability. For example, a student may initially be asked to solve the problem individually, and then be told to collaborate with other students. This change of status could be used to assess adaptability, collaboration, and complex communication skills.
  • In corporate assessment, the goal is for each individual to possess adaptability and other 21st century skills, as well as for groups to have these skills. Job performance tests of adaptability are sometimes used to identify those individuals who may be best able to cope with, and adapt to, physically dangerous situations.
Copyright © 2010, National Academy of Sciences.
Bookshelf ID: NBK32690


  • PubReader
  • Print View
  • Cite this Page
  • PDF version of this title (954K)

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...