Guiding the Design of Evaluations of Innovations in Health Informatics: a Framework and a Case Study of the SMArt SHARP Evaluation
Abstract
Development of health information systems innovations is necessary to create a better future for health and health care, but evaluating them is challenging. This paper examines the problem of evaluating health IT projects in which innovation is agile, adaptive, and emergent, and in which innovation diffusion and production are interlinked. We introduce a typology of mindsets for evaluation design that are typically used in health informatics: optimality, contingency, and usefulness, and make the case for a modularity mindset. We propose a model that shifts the unit of analysis from an evaluation as a whole, to specific modules of an evaluation, such as purpose, target, and methods. We then use retrospective participant observation to illustrate the approach using a case study: the ONC SHARP Harvard project developing the SMArt platform (smartplaforms.org). We find that the proposed modular approach to evaluation design provides a balanced alternative to standard archetypical designs on the one hand, and fully custom-made designs, on the other hand.
Introduction
With the emergence of heavily funded information systems innovation projects, there is a legitimate interest in evaluation of such projects. Evaluation in health informatics is challenging because it lies at the intersection of three notoriously complex areas: health care, information systems, and evaluation methodology (Friedman & Wyatt, 2006). In addition, developing and evaluating innovations is different from improving and testing existing programs. Improvement affects attainment of target outcomes. Innovation affects both attainment of outcomes and the very conceptualization of outcomes. How can we evaluate an innovative program when we do not know what trajectory and outcomes to expect? Evaluating the development of innovations presents challenges that defy the methods that work well to evaluate interventions, improvements, and stable technologies (Patton, 2011). Therefore, evaluating the development of health information system innovations presents an additional set of challenges because it lies at the intersections of four, not three, complex areas: health care, information systems, evaluation methodology, and innovation. A number of those challenges are well documented (Greenhalgh, 2010a).
Development and evaluation of health information systems innovations is necessary to create a better future for health and health care. Incremental improvement and continuous improvement are important but can only go so far. Breakthrough change and innovation are necessary, and increasingly sought after. The health care sector in the US is witnessing the emergence of large-scale projects aiming at pioneering system innovations. Public and private agencies are funding and organizing efforts to create breakthrough change in the way health and health care are managed. Prominent among those efforts is the federally funded impetus to use incentives and penalties to accelerate the adoption and meaningful use of electronic health records by health care providers. The Office of the National Coordinator for Health IT (ONC) is also sponsoring five Strategic Health IT Advanced Research Projects (SHARP) under the umbrella of the Health Information Technology for Economic and Clinical Health Act (HITECH). Another example is the Pioneer Portfolio of the Robert Wood Johnson Foundation (RWJF), which funds projects with the specific intent of spurring innovation, such as Project HealthDesign (PHD), a project aimed at “rethinking the potential of Personal Health Records”. Ultimately, the intent of such efforts is to transform the health care sector via breakthrough change that complements incremental and continuous improvement.
Given the amount of social, financial, and political capital invested in health information systems innovation efforts, there is interest in credibly evaluating them. For example, RWJF has assigned an evaluator to Project HealthDesign, and the project team has also conducted internal evaluation activities for its sub-project grantees. Similarly, the SMArt project, one of the five ONC SHARP projects, is even undergoing three evaluations: an external one by the ONC, another external one commissioned by the project team, and an internal one assigned by the project team. The authors of this paper are members of the second—project-team commissioned—external evaluation team, tasked with conducting a formative and summative evaluation of the project.
In light of the necessity and difficulty of evaluating health information systems innovations, evaluators are faced with the thorny challenge of designing evaluation studies with sufficient external validity (for scalability) but also that properly meet the needs of their clients (complex when evaluating innovation). The purpose of this paper is to build some foundation to inform evaluation design decisions. We begin by formulating a conceptual framework synthesizing important aspects of the problem. We use insights about scalability from industrial design and about complexity from complexity science. We then propose a model to guide selection or design of an evaluation approach and apply to the model the characteristics of a case study that was not used for the development of the model. We conclude with a presentation and analysis of the results.
Background
As we previously noted, evaluation in health informatics lies at the intersection of three areas, each notorious for its complexity: health care, informatics, and evaluation (Friedman and Wyatt, 2006). Evaluation of health informatics innovation therefore lies at the intersection of four areas of high complexity: health care, informatics, evaluation, and innovation (Figure 1).

Overlapping realms of evaluation of health informatics innovation
(a) Friedman and Wyatt, 1997 & 2006. (b) Greenhalgh et al., 2004. (c) Greenhalgh et al., 2010b. (d) Patton, 2011
Innovation defies many existing evaluation approaches. Many evaluation approaches make assumptions that are inadequate in situations of high complexity and uncertainty. First, it is assumed that the program to be evaluated has an intended target that is stable, knowable, and known. Second, it is assumed that the program itself is stable, or at least ought to be stabilized, through formative evaluation, to prepare it for summative evaluation. As we will see, these assumptions rarely hold in the case of innovation evaluation. The second assumption may hold partly, namely that there might be intent to stabilize the programs. However, it is doubtful that the programs can be stabilized, given the dynamic nature and the volatility of their environments. It is also possible that innovation would be stifled if the structure, constraints, and stability were forced on the programs prematurely (Perrin, 2002). Patton (2011) argues convincingly that while there are contexts where formative and summative evaluations are aligned with the situation and the purpose of the evaluation, there are contexts in which a different approach is needed, one he calls developmental evaluation.
According to Patton (2011), formative-summative is a traditionally dominant distinction in the field of evaluation: a formative evaluation is meant to improve a program model, stabilize it, standardize it, and prepare it for a summative evaluation, which is meant to test, prove, and validate a fixed program model. Both formative evaluation and summative evaluation thus assume that the purpose of evaluation is to test a model. Patton argues that innovation projects do not adhere to the stages of model improvement and model testing, and revolve instead around model development. Accordingly, he asserts that such projects require a developmental evaluation approach.
Phenomena that may require developmental evaluation or other new approaches are notorious in many literatures. Some labels that have been used to describe some of them include “mode-2 strategies” (Regeer et al., 2009), “wicked problems”, “complex systems”, “complex adaptive systems”, “innovation”, and “system innovation”. There is a growing interest and impetus to study such phenomena. This may be due to the increasing effectiveness with which we are solving “simpler” problems and understanding “simpler” systems. Or, it could be due to the world actually becoming more complex. Or, finally, it might be because of changes in the way we work together to solve problems and understand phenomena: our organizations themselves, as well as our collective action patterns, may be complexifying. As a result, we can expect to see an increase in the prevalence of situations where traditional modes of evaluation fail. Patton proposes developmental evaluation to begin to fill that gap. This paper contributes to this conversation by examining the challenges of designing evaluations of health informatics innovations and building a foundation for the development of methods for the design of evaluation, to complement the panoply of existing methods for data collection and analysis (also known as evaluation methods). This distinction between evaluation design methods and evaluation methods is important, as will become apparent in the course of our discussion.
Methods: 1) Model Development
How can evaluators make choices about how to design evaluation studies? There are at least three possible mindsets that inform design decisions. (1) Some evaluation designs are better than others, and the evaluator must always select the best design possible. We can call this mindset the optimality approach to evaluation design. (2) There is no absolute ranking of designs. The superiority or appropriateness of evaluation designs depends on the context, and the evaluator must select the evaluation design that best fits the context of what they are evaluating (the evaluand). We can call this mindset the contingency1 approach to evaluation design. (3) There is no absolute fit or mapping of designs to contexts. A broad range of evaluation design options are likely to lead to studies that provide potentially useful information. The evaluator must take responsibility for their design decisions and understand what can be learned from the results of the methods they chose. We can call this third mindset the usefulness approach to evaluation design.
In other words, the optimality mindset states that there is some design that is the best for all evaluation purposes and in all contexts. Some statements from proponents of Randomized Controlled Trials (RCT) seem to come from that mindset, and imply for example the superiority of RCTs (Liu & Wyatt, 2011). A less restrictive version of the optimality mindset, which we can call the weak optimality mindset, concedes that inferior designs may have to be used, but only when superior designs are infeasible or too expensive. The contingency mindset asserts that the design of the evaluation must match the purpose and context of the evaluation. For example, “JAMIA agrees that RCTs are the preferred standard for evaluation in general, but it is not appropriate in all circumstances.” (Lehmann & Ohno-Machado, 2011) The same editorial article from the Journal of the American Medical Informatics Association (JAMIA) also states that, “regardless of design, justification of the match between purpose and design is paramount.” JAMIA clearly rejects the strong optimality mindset, and appears to follow a contingency mindset. A stronger version of that mindset restricts choices to a “correct” or “right” match. In a sense, that strong contingency mindset is no more than a context-dependent version of the optimality mindset. Conversely, the usefulness mindset is the least restrictive of the three, and does not assume the existence of an overall best or of a best fit. It does not even require the notion of a fit, match, or “correct” choice of design. It merely states that (a) design choices are inevitably made, (b) that choices made create affordances and constraints, and (c) that they provide potentially useful results. The emphasis is not on selecting the best study design or the right study design for the situation, but rather to be intentional when designing the study that will be used, and to understand and be able to justify the design decisions. The emphasis thus shifts from selection to design: from selecting a template study design from a catalog of archetypical designs, then modifying it to fit the situation, to designing a tailored study for the situation. As we already showed with the examples of weak optimality and strong contingency, each evaluation design mindset can be more or less restrictive. We can call a more restrictive version of a mindset strong, and we can call a less restrictive version of a mindset weak. The six combinations are described in Table 1: strong and weak optimality, strong and weak contingency, and strong and weak usefulness.
Table 1.
Typology of evaluation design mindsets
| Strong | Weak | |
|---|---|---|
| Optimality | The best way | The best way feasible |
| Contingency | The right way for the situation | A way that fits the situation |
| Usefulness | Getting the most useful results from the chosen way | Getting useful results from the chosen way |
The typology of evaluation design mindsets that we just constructed (Table 1) is neither unique nor complete—in fact we will introduce a fourth mindset shortly—but it gives us a conceptual starting point and a useful vocabulary for the discussion of evaluation design for health information systems innovations. But first, a few words of caution are advanced.
It should be clear that none of these mindsets is somehow superior in an absolute sense. In their classic textbook on evaluation in health informatics, Friedman and Wyatt (2006, p 100) state: “gold standards, even if unattainable, are worth approximating. That is, “tarnished” or “fuzzy” standards are better than no standards at all.” They then temper the optimality mindset in that statement with the statement “perfect gold standards do not exist in biomedical informatics or in any other domain of empirical research”, but conclude the paragraph by saying that “studies comparing the performance of information resources against imperfect standards, so long as the degree of imperfection has been estimated, represent a stronger approach than studies that bypass the issue of a standard altogether.” The implication is clear: comparing performance to some standard is superior to not comparing to a standard, largely regardless of context. Similarly, Friedman and Wyatt discuss other aspects of evaluation where their position is more consistent with a usefulness mindset: “We are suggesting that a mode of study from which something important can be learned is a mode of study worth pursuing. In evaluation, the methods should follow from the questions and the context in which the evaluation is set. It is wrong to give any particular method of study higher status than the problem under study.” Finally, they dedicate a whole subsection to the discussion of matching aspects of the evaluation to aspects of the evaluand (the program, process, or IT innovation that is evaluated), consistent with a contingency mindset: (1) Matching what is evaluated to decisions that need to be made. (2) Choosing the level of evaluation. (3) Matching what is evaluated to the type of information resource. (4) Matching how much is evaluated to the stage in the life cycle. Though their discussion only covers a small set of aspects, it shows the importance of examining different facets of an evaluation. We will be building on that insight.
We have pointed out that evaluation design mindset is a factor that affects how evaluation design decisions are made, and we have listed three evaluation design mindsets: optimality (selecting the best design), contingency (selecting the right design for the situation), and usefulness (justifying the usefulness of the design chosen or developed). We have then shown that all three mindsets are used in health informatics evaluation. We trust that the reader can appreciate the importance of being aware of these possible mindsets when appraising an evaluation design or contributing to designing an evaluation. In addition to addressing questions about what design options are chosen or what design options to choose, the reader will now ask questions about why certain design options are chosen or why choose certain design options.
An optimality mindset may be useful when stakes are high. For example, Liu and Wyatt (2011) argue the case for the Randomized Controlled Trial as a standard for evaluating clinical information systems by asserting that clinical information systems are expected to influence patient outcomes, and when patient outcomes are at stake, the RCT is the gold standard. Having established the necessity for a rigorous standard given the stakes, and having asserted that the RCT is the most rigorous standard, their task becomes to argue that the practical and ethical challenges to using the RCT in health informatics are surmountable.
A contingency mindset, on the other hand, may be useful when the situation internal and external to the evaluand is stable and well known. For example, numerous evaluation studies have been conducted on the effect of barcoded medication administration (BCMA) on clinician workflows in hospitals, a new evaluation of the same topic would be able to draw on evaluation reports from the literature, examining what approaches were used and with what consequences. In such a situation, it would be unnecessary and perhaps inefficient to design an evaluation from scratch. It would be useful instead to develop a design that is contingent on the context (hospitals), the technology (BCMA), and the evaluation target outcome (BCMA effect on workflows), and that builds on previous evaluations with a comparable context, technology, and target.
Finally, a usefulness mindset will be useful in a broad range of situations, including the ones described above, since its assumptions are less restrictive than the optimality and contingency mindsets, and do not preclude following those mindsets if needed.
Note then that the optimality mindset is in a way a special case of the contingency mindset, which is in way a special case of the usefulness mindset2. This statement has simple but profound implications for the way evaluation design decisions are made. (1) If one knows exactly what they are doing and can justify it, then the optimality mindset may be the most appropriate. Namely, select the (known) best evaluation design. (2) If one is not as confident but the evaluation problem they are facing is stable, well defined and well studied in a context that is stable and well understood, then the contingency mindset may be the most appropriate. Namely, select one of the established evaluation designs that have been successfully demonstrated for this problem and context. (3) If none of these conditions hold, as is typically the case with innovation, then the usefulness mindset would provide the most possibilities and may therefore be the most appropriate. We will shortly propose a model to assist evaluation design, that adds some useful structure to the usefulness mindset. First, a brief summary would be helpful.
Due to the challenges and complexity of evaluation in health informatics, there are a large number of evaluation approaches and methods, and each evaluation study needs to be tailored to the project and context (Friedman & Wyatt, 2006). It can be overwhelming for evaluators facing a large set of design decisions. We trust that the reader appreciates by now that there are different ways to think about those decisions, that they are all likely to be useful in some way, and that they are all already used in health informatics evaluation, though maybe not deliberately or knowingly. We have pointed out and defined three of them: the optimality, contingency, and usefulness mindsets to evaluation design. We have argued that while those three mindsets are not rank-ordered with respect to superiority, they can (approximately) be rank-ordered with respect to generality. The optimality mindset is a special case of the contingency mindset, with a very loosely defined concept of context, and the contingency mindset is a special case of the usefulness mindset, with a very tightly defined concept of context. Having fleshed out the three evaluation design mindsets in the broad realm of health informatics evaluation, we can now turn our attention to the subset of that realm that is concerned with evaluating health information systems innovations.
Innovations differ from fixed programs in that they resist standardization and rigidity. It turns out that the optimality mindset is consistent with a tendency toward standardization, and the usefulness mindset is consistent with a tendency toward customization. In fact, as suggested above, the context is specified very openly in the optimality mindset and very narrowly in the usefulness mindset. The contingency mindset, especially the strong contingency mindset, can also be consistent with a tendency toward standardization, at least within certain classes of contexts. Note that there is a subtle but important nuance in the way the phrase “evaluation design” can be used. Evaluation design can refer to the archetypical design selected for a study (e.g., an experimental design, a quasi-experimental design, a naturalistic design, etc.), or it can refer to the process of designing a study (that may or may not be archetypical). The phrase “a design” refers to a set of decisions made. The phrase “design” refers to the act of making those decisions. This distinction becomes especially important in the discussion of innovation evaluation, where there might not yet exist an archetypical design that can be used in a particular situation, and evaluation may need to be custom-designed.
We have seen that the optimality and contingency evaluation design mindsets require unnecessarily restrictive assumptions that rarely hold in practice. However, in many situations those restrictions serve useful purposes as frameworks and scaffolds that provide necessary structure for the streamlined execution of evaluation studies and the comparison of results across evaluation studies. The optimality and contingency mindset reduce the degrees of freedom afforded by the usefulness mindset, considerably reducing the design space to manageable dimensions. In a sense, they act as helpful design heuristics. It would be inefficient and unreasonable to reinvent study designs for routine, well-structured projects, in the presence of standard general and context-specific design archetypes. Custom design of tailored, usefulness-based evaluation studies has many advantages but is often costly, and sometimes inefficient. Unfortunately, evaluation design archetypes may be infeasible or undesirable in certain situations, such as the evaluation of innovations, which exhibit uncertainty, idiosyncrasy, dynamic trajectory, and contextual interdependence.
The tension between customization and standardization is well known from an industrial design perspective. Insights from the design of clothes can be useful for the design of evaluation studies. On one extreme, every single shirt can be made to order, custom-designed to fit the specific body characteristics, needs, and wants of a specific customer at a specific point in time. While that approach gives customers the features they want with very high precision, it is extremely costly and non-scalable. The equivalent in evaluation design is not having design archetypes at all, and improvising (often reinventing) study-designs on a case-by-case basis. On the other extreme, all shirts can be the exact same size, shape, and color. The design and manufacturing becomes much cheaper and much more scalable. However, shirts are then too loose or too tight on a lot of people, not to mention boring and psychologically disturbing. The equivalent in evaluation design is having a standard structure for how an evaluation study should be, and systematically using it to evaluate any project.
In the real world, of course, we have a hybrid of both approaches. Historically, we had customization only, when standardization was infeasible. Now we have a combination of customization and standardization, depending on the level of analysis (see Figure 1): every shirt has dimensions below 3 meters. Every tank top has no sleeves. Every T-shirt has short sleeves. Every dress shirt has long sleeves. And so on. Note the relation of the object “sleeve” to the object “shirt”. A shirt is a modular object. It consists a number of modules, including sleeves, a collar, and a torso. Also note the notions of short sleeves, long sleeves, and no sleeves. The module sleeve of the object shirt has the attribute sleeve length. It also has other attributes, including color, width, and fabric. Similarly, the other modules of a shirt have attributes, and some of them may have different types of attributes. For example, the collar module has an attribute that takes a number of values including standing, turnover, and flat. It is clear that an infinite number of possible shirt designs can be created through different combinations of modular attributes, without having to have a single standard design, a finite number of standard models (i.e., archetypes, templates) or custom-made tailored designs. Also note that for each object there are different ways it can be partitioned into a set of modules. A shirt could be partitioned into “front” and “back”, as T-shirt printing companies often do, instead of “sleeves”, “collars”, and “torso”.
Similarly, consider the set of all possible evaluation designs. They are clearly not all the same, so we do not have pure standardization. However, they are not all different in all aspects either. Additionally, they share common characteristics. For example, all evaluations have a purpose and stakeholders. These common characteristics can be used to introduce modularity in the systematic design of evaluations. We have seen that pure standardization is too restrictive as a guide for innovation evaluation design in health informatics. On the other hand, pure customization can be inefficient, can duplicate effort, and cannot be easily reproduced. We have pointed out that restrictions can serve useful purposes and act as frameworks and scaffolds to guide design. A hybrid approach can strike the balance between pure standardization and pure customization. The earlier discussion suggests that one hybrid approach that is currently used in health informatics evaluation is related to the contingency mindset. We now propose a fourth evaluation design mindset, which we can call the modularity mindset, that enables another kind of compromise between pure standardization of evaluation designs and pure customization of evaluation designs. The contingency mindset compromises by partitioning the set of possible contexts into a set of classes of contexts, then matching classes of contexts to standard classes (archetypes) of evaluation designs. The modularity mindset, on the other hand, partitions the set of possible evaluation designs into a set of modules, each associated with a set of design decisions. The modularity mindset thus retains some of the standardizability of the optimality mindset, and some of the customizability of the usefulness mindset.
The implications of each of the four different mindsets for evaluation design are important. A mathematical formalization of these simple but profound implications is possible, using Set Theory and Category Theory, but we do not need it for our present purposes. We can use the shirt example again to illustrate the differences between all four mindsets, and the implications of each on the structure and scalability of evaluation designs, on the one hand, and on their variety and complexity, or the other hand. Innovations are uncertain, dynamic, and unbounded, have an infinite unbounded set of possibilities, and require designs that are able to cope with that variety and complexity. On the other hand, programs are structured, typically have a finite bounded set of possibilities, and can be evaluated with structured standard designs.
To keep the discussion short, let us focus on one attribute of shirts: their size. The optimality mindset would produce a one-size-fits-all design. The result is easy manufacturability, at the cost of some disgruntled outliers. The contingency mindset introduces fixed size classes such as XS, S, M, L, and XL. Some of the outliers would now be satisfied, but there would remain some outliers whose bodies have different proportions (e.g., ratio of arm length to torso length) that are not accounted for in the one-ratio-fits-all size classes. The modularity mindset introduces notions like sleeve length, waist width, and torso length, allowing for a large number of combinations. On the other extreme, the usefulness mindset would be open to all possibilities, and sizes can be made to order and custom designed for individuals.
We now turn to the practical matters of presenting one possible model that is consistent with the modularity mindset. The purpose of the model is to assist evaluators in designing useful and appropriate evaluation studies. The model is adapted from Patton (Table 2) and consists of 26 attributes that characterize evaluations, organized in 7 sections (modules). Patton actually uses those attributes for a different purpose: to compare and contrast what he calls traditional evaluation with what he calls developmental evaluation. His comparison suggests that evaluations can fall under either one or the other of the categories, and it is implied that traditional evaluation is better for some project situations and that developmental evaluation is better for some other project situations. In that sense, he seems to be more aligned with a contingency mindset than with a usefulness mindset. His implicit premise is that evaluation approach must match the project situation and context. That is, given the project, the right approach is either traditional evaluation or developmental evaluation. This premise is tested and refuted in our case study discussion. The rest of Patton’s book does eventually relax the restrictions imposed in his comparison between traditional and developmental evaluation and becomes more consistent with a usefulness mindset. Though his comparison summary table (Table 2) may be oversimplified and potentially misleading, the list of attributes he uses to structure the comparison is consistent with Friedman & Wyatt’s (2006) discussion, and provides an excellent starting point for thinking about evaluation design in a more textured and possibly more modular way. Patton’s list of what we can call evaluation design attributes is consistent with and goes beyond the issues mentioned in Friedman and Wyatt’s brief discussion of “what and how much to study”. We use Patton’s list (Table 2) as a structure for the proposed model. This is appropriate since his list is consistent with the demands of health informatics evaluation and is more expansive than Friedman and Wyatt’s discussion.
Table 2.
Evaluation design modules and attributes
From Patton, 2011, pp 23–26, Exhibit 1.2, table content not listed
| (A) Traditional program evaluation tendencies | (B) Complexity-sensitive developmental evaluation | ||
|---|---|---|---|
| 1. Purpose and situation | 1.1 Evaluation purposes | ||
| 1.2 Situation where it is appropriate | |||
| 1.3 Dominant niche and mindset | |||
| 2. Focus and target of evaluation | 2.1 Target of change | ||
| 2.2 Driving force of the intervention | |||
| 2.3 Evaluation results focus | |||
| 2.4 Evaluation locus | |||
| 3. Modeling and methods | 3.1 Modeling approach | ||
| 3.2 Counterfactuals | |||
| 3.3 Measurement approach | |||
| 3.4 Attention to unexpected consequences | |||
| 3.5 Evaluation design responsibility | |||
| 3.6 Methods approach and philosophy | |||
| 3.7 Interpretation and reasoning processes | |||
| 4. Roles and relationships | 4.1 Ideal evaluator stance | ||
| 4.2 Locus and focus of accountability | |||
| 4.3 Organizational locus of evaluation | |||
| 5. Evaluation results and impacts | 5.1 Desired and ideal evaluation findings | ||
| 5.2 Evaluation approach to model dissemination | |||
| 5.3 Reporting mode | |||
| 5.4 Impact of evaluation on organizational culture | |||
| 5.5 Evaluation capacity built through the evaluation process | |||
| 6. Approaches to complexity | 6,1 Approach to uncertainty | ||
| 6,2 Approach to control | |||
| 7. Professional qualities | 7.1 Key evaluator attributes | ||
| 7.2 Evaluation standards and ethics |
Using the modularity mindset, we do not necessarily have to impose the restrictions Patton imposes with the last two columns. Instead, we could discuss each module and attribute separately, allowing for all possibilities within each module-attribute pair. The only structure we would introduce is the partition into modules and attributes. However, for the purpose of illustration and due to space constraints, we will limit our scope to the binary design choice presented by Patton: (A) Traditional program evaluation or (B) complex-sensitive developmental evaluation? We examine whether it is feasible to answer that question about the entire project as a single unit of analysis, like Patton seems to suggest, and show a case study where it is not. We then ask whether it can be feasible and useful to classify each module attribute of a project and its evaluation as either A or B (modularity mindset), rather than classifying the whole project as either A or B. Future work can attempt to generalize the resulting findings to non-binary design choices and to design questions not based on choosing between finite options.
Methods: 2) Model testing
Design
We use an instrumental case study to provide insight into the problem of evaluating innovations in health informatics. Stake (1994) identified three types of case studies: intrinsic, instrumental, and collective. In an intrinsic case study, the aim is to understand a specific case because it has particular features of interest, or because it is representative of some broader class of cases. For our purposes, we are more interested in the problem of evaluating health IT innovations than in the specifics of the case. Accordingly, we do not present an in-depth examination of the case in this paper. On the other hand, in instrumental case studies, the aim is providing insight into an issue or problem or to refine a theory. In this instance, understanding the complexities of the case is secondary to understanding something else.
The case report presented here is based on a retrospective participant observation that had the following characteristics on Patton’s (1980) five dimensions of variations in approaches to observation.
- Role of the observer: as members of the evaluation team, the authors played the role of full participant observer.
- Portrayal of the observer’s role to others: since the authors were participants, the other members of the evaluation team were aware of their presence, and their observations were overt, albeit unmentioned.
- Portrayal of the purpose of the observation to others: given the retrospective nature of this analysis, the purpose of the observations was not explained to the non-observer participants.
- Duration of the observations: the authors have been (and still are) participant observers of the SMArt evaluation since its inception, over 2 years ago.
- Focus of the observations: a holistic view of the entire setting and its elements was gathered. A guiding structure (Table 2) was then used to analyze particular aspects of the case.
Retrospective participant observation has the advantage of sidestepping the observer’s paradox (Labov, 1972). That phenomenon, also known as the experimenter effect and the Hawthorne effect is best illustrated by this quote from Labov: “the aim of linguistic research in the community must be to find out how people talk when they are not being systematically observed; yet we can only obtain this data by systematic observation.” A subtly different threat that we are also able to avoid is called the observer-expectancy effect, in which the behavior of the observed subjects is influenced not simply because they are conscious of being observed, but because the observer him/herself unconsciously behaves in a way that influences the behavior of the observed subjects. Both the observer’s paradox and the observer-expectancy effect are addressed by the retrospective nature of the observation in this study.
Research questions
(1) To what extent is it feasible and useful to use the proposed model to highlight specific challenges to designing an evaluation for this case? (2) What are the barriers and facilitators to the usability and usefulness of the model in this case? (3) Testing Patton’s implicit premise: Do all evaluation design attributes align (together) with either traditional or developmental evaluation? (4) Are there any evaluation design attributes that do not align with either traditional or developmental evaluation?
Case description
The SMArt project is fully titled “Substitutable Medical Apps, reusable technologies”. It is one of five Strategic Health IT Advanced Research Projects (SHARP) sponsored under the umbrella of the Health Information Technology for Economic and Clinical Health Act (HITECH) by the Office of the National Coordinator for Health IT (ONC). The purpose of the project is to investigate, evaluate, and prototype approaches to achieving a generative IT architecture that supports substitutability of medical apps. The authors of this paper are members of the SMArt external evaluation team.
Results
We present the classification of the attributes of the first 2 modules and a table summarizing the classification results for all 7 modules. We classify each module attribute as A if it is consistent with Patton’s “traditional program evaluation tendencies”, B if it is consistent with Patton’s “complexity-sensitive development evaluation”, O for “other”, X if we cannot decide because we lack sufficient information, and U for “undecided” if we cannot decide for some other reason. During the classification process, it became apparent to us that in some cases, the classification was different with respect to the project or to the evaluation. We therefore specify the distinctions where needed.
- 1. Purpose and situation
- 1.1 Evaluation purposes—The purpose of the team as outlined in the SMArt proposal to the ONC is to conduct “formative and summative evaluation activities to document the achievements of the SMArt initiative and appraise the impact of SMArt on technical capacity, clinical practice, and patient outcomes”. In practice, despite the evaluation purpose being consistent with classic standards of evaluation practice, the team faced challenges when attempting to put it in action, including ambiguity and elusiveness of project plan and change in the project’s external situation, such as the ONC asking the project team to accelerate the project timeline by 6 months, while keeping the same objectives and budget. We classify this attribute as A for the project and A(stated) and B(actual) for the evaluation.
- 1.2 Situation—First, the section of the SMArt proposal that described the evaluation task for our team consisted of a single paragraph that contained a high level description of the purpose and broad milestones. Such a large amount of flexibility made it difficult for the evaluation team to develop an evaluation study plan, and for the project team to provide the evaluation team with necessary information. As a matter of fact, the project team and the evaluation team are still inquiring about the scope of the each other in the end of Year 2. The flexibility and vagueness of the specifications of the evaluation task parallel the complex and dynamic nature of the project and the economic, industrial, and policy ecosystem in which it is embedded (Mandl & Kohane, 2008). Since the “key variables expected to affect outcomes are [not] controllable, measurable, and predictable” and there are “multiple pathways possible” (Patton, 2011), we can classify this attribute as B.
- 1.3 Dominant niche and mindset—The SHARP funding mechanism is concerned with accelerating technology implementation rather than innovation and solution exploration. Similarly, the evaluation task required a focus on impact and effectiveness, not on development. However, the SMArt project did not satisfy the ideal conditions of formative evaluation, which require focusing improvement efforts on clear, specific, measurable, attainable, and time-bound target outcomes of a draft program model that needs fine-tuning. As we began planning the formative evaluation activities, we faced the obstacle of not having clear target outcomes, which suggests that the project may still be in too early a phase of development where priorities shift continuously, and development occurs serendipitously. After many bi-weekly meetings with the SMArt Executive Director, we proposed five outcomes: “awareness”, “participation”, “acceptance”, “adoption”, and “implementation” (Baran et al., 2011). However, we are still facing difficulties in operationalizing and measuring those broadly defined outcomes. Similarly, it seems possible that the project may not reach the stability required for summative evaluation within the evaluation study period, due to its dynamic and complex nature. To date, any outcomes that have been identified by the project team or the evaluation team have been too vague to measure, too elusive to track, or too prematurely specific. We classify this attribute as A(stated) and B(ideal) for the project and A(stated) and B(actual) for the evaluation.
- 2. Focus and target of evaluation
- 2.1 Target of change—Defining the units of analysis of the evaluation has been one of the major problems in designing an evaluation approach. Initially, three broad stakeholder groups were defined: vendors (e.g., Epic, Cerner, and other EHR vendors that would SMArt-enable their systems), developers (individuals or organizations that would develop SMArt apps), and clinical end users. However, it has become apparent, that stakeholders’ acceptance of the SMArt ideas and their willingness to adopt them are interlinked with other stakeholders, and with industry-wide paradigm shifts and market forces. Evaluating change in the behaviors of specific stakeholders requires identifying which organizations or individuals to study, and identifying specific outcomes to measure. On the other hand, treating SMArt as a disruptive social innovation aimed at cross-scale impacts on big problems, which is reasonable, poses the challenge of how to evaluate its interim progression toward those impacts. We classify this attribute as A(stated, actual) and B(ideal) for the project and A(stated) and B(actual) for the evaluation.
- 2.2 Driving force of the intervention—SMArt is presented and approached by the project team (Mandl & Kohane, 2008) as systems-change-driven. However the evaluation task is presented to the evaluation team as outcomes-driven, focusing on awareness and adoption. We classify the attribute as B for the project and A(stated, actual) and B(ideal) for the evaluation.
- 2.3 Evaluation results focus—According to the SMArt project proposal, the evaluation team is tasked with “documenting SMArt integrity and adherence to or deviation from the anticipated vision of SMArt. During years 3 and 4, [the evaluation team] will focus on appraising the baseline function of SMArt as defined by the Boston team, and then will use on-site observations, interviews and review of data to evaluate the impact of SMArt on the ecosystem.” These tasks are distinctly traditional formative and summative evaluation tasks. To date, the evaluation team has not had the opportunity to support action in the development process. We classify this attribute as X for the project (we cannot decide do to lack of information), and A for the evaluation.
- 2.4 Evaluation locus—The evaluation is neither top-down (theory driven) nor bottom-up (participatory) in this case. The evaluators are struggling in “the muddled middle where top-down and bottom-up forces intersect and often collide” (Patton, 2011). According to Patton, developmental evaluators help innovators navigate the muddled middle. We have not had the opportunity to offer such help. Our understanding is that from the perspective of the project team, the external evaluation of this project is top-down (theory driven). This understanding is supported by the use of language such as “This [evaluation approach] has proven to be a highly successful model [citation]” in the written evaluation task. We therefore classify this attribute as A for the project, and B(ideal) and O(actual) for the evaluation. As previously stated, we will summarize the rest of the case study in Table 3.
Table 3.
Case study evaluation design modules and attributes of the external evaluation of SMArt
Project and context Evaluation task (A) Traditional program evaluation tendencies (B) Complexity-sensitive developmental evaluation (A) Traditional program evaluation tendencies (B) Complexity-sensitive developmental evaluation 1.1 Evaluation purposes Yes No Stated Actual 1.2 Situation where it is appropriate No Yes No Yes 1.3 Dominant niche and mindset Stated Ideal Stated Actual 2.1 Target of change Stated & actual Ideal Stated Actual & ideal 2.2 Driving force of the intervention No Yes Stated & actual Ideal 2.3 Evaluation results focus Unknown Unknown Stated & actual Undecided 2.4 Evaluation locus Yes No (actual: neither A not B ) Ideal 3.1 Modeling approach No Ideal No Ideal 3.2 Counterfactuals No Yes No Yes 3.3 Measurement approach No Ideal No Yes 3.4 Attention to unexpected consequences Somewhat Somewhat No Yes 3.5 Evaluation design responsibility Yes No Stated Ideal 3.6 Methods approach and philosophy Yes No No Yes 3.7 Interpretation and reasoning processes Yes No No Yes 4.1 Ideal evaluator stance Yes No Yes No 4.2 Locus and focus of accountability Unknown Unknown Somewhat Somewhat 4.3 Organizational locus of evaluation Yes No Actual Ideal 5.1 Desired and ideal evaluation findings No Yes No Yes 5.2 Evaluation approach to model dissemination Yes No Undecided Undecided 5.3 Reporting mode Somewhat Somewhat Somewhat Ideal 5.4 Impact of evaluation on organizational culture Actual Ideal Actual Ideal 5.5 Evaluation capacity built through the evaluation process No Yes Actual Attempts 6,1 Approach to uncertainty No Yes No Yes 6,2 Approach to control Yes No No Yes 7.1 Key evaluator attributes Yes Unknown No Yes 7.2 Evaluation standards and ethics Same Same Same Same
Discussion
We begin with a discussion of the case study guided by the four research questions. (1) The case study shows that the proposed model helps give a potentially useful granular and structured view of some of the challenges in evaluating health IT innovation in the case of SMArt. As for feasibility, some cells in the table were difficult to fill out, and some determinations were not obvious or clear cut. More importantly, both the resulting table and the process of filling it out require the analyst to ask important questions about specific aspects of the project and the evaluation. We argue that mere guidance to think about important questions is useful. (2) The main barrier to using this model is that there are no clear answers to the questions involved in filling out each of the cells of the table. The model cannot be viewed as a tool for answering design questions, but is rather useful as a collection of checklists for making sure to ask important design questions, and to be aware of which questions are being asked and how they are being answered. (3) The case study clearly shows that it is not always preferable to classify a whole project as either traditional or complex and adaptive, or evaluation requirements as either formative/summative or developmental. Furthermore, it is not always feasible to use the whole evaluation as the unit of analysis when making design decisions. It is not possible to classify the SMArt evaluation as a whole as either traditional (formative/summative) or developmental. A more textured view may be necessary, as it was in this case, with specific evaluation attributes rather than the whole project as units of analysis. Having a list of those attributes helps avoid an ad-hoc approach to examining the various units of analysis relevant to evaluation design. (4) The table shows that some evaluation design attributes did not fit just one of the two categories, traditional and developmental. This suggests to the analyst (e.g., an evaluator using the table) that different questions may need to be asked about those attributes.
Furthermore, we have found that the terminology used by Patton in naming the items in the list is too restrictive for our purposes. This is unsurprising since his specific purpose for using the list was just to compare and contrast traditional and developmental evaluation. When using our proposed approach to think about evaluation design, evaluators are advised to frame each of the evaluation design attributes more broadly, to add guiding questions for each of the attributes, and to introduce additional attributes that we have not included. The model we presented shows promise, as a preliminary illustration of thinking about evaluation design modularly, and as foundation for future development and testing of instruments (e.g., worksheets, checklists, or interactive computer forms) that assist evaluators in the design of evaluation studies, especially when the evaluand is innovative, complex, and adaptive, and standard archetypes of evaluation studies are not applicable. Recommended future work includes testing this model and subsequent versions using a larger set of case studies, then considering wider dissemination to other evaluators for field-testing.
Complexity profile: (a) pure customization, (b) pure standardization, and (c) hybrid. Adapted from Bar-Yam (1997)
Acknowledgments
The authors acknowledge financial support from the SMArt SHARP project, and feedback from members of the Brennan Health Systems lab.
Footnotes
1In reference to contingency theory, according to which there is no absolute optimal approach to any endeavor, and optimality is contingent (dependent) on the internal and external situation (context)
2To be precise, slight distinctions need to be made with respect to the weak and strong aspect of the mindsets, but they are not necessary for our present purposes.

