NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

Feeny DH, Eckstrom E, Whitlock EP, et al. A Primer for Systematic Reviewers on the Measurement of Functional Status and Health-Related Quality of Life in Older Adults [Internet]. Rockville (MD): Agency for Healthcare Research and Quality (US); 2013 Sep.

Cover of A Primer for Systematic Reviewers on the Measurement of Functional Status and Health-Related Quality of Life in Older Adults

A Primer for Systematic Reviewers on the Measurement of Functional Status and Health-Related Quality of Life in Older Adults [Internet].

Show details

Patient-Reported Outcomes, Health-Related Quality of Life, and Function: An Overview of Measurement Properties

In general, we rely on patient reports to assess HRQL and function. This section provides a discussion of the most important considerations when using evidence derived from the application of such measures. But first it is necessary to clarify terminology and provide definitions for the important relevant concepts and measurement properties.


There is considerable heterogeneity in the terms used to describe HRQL and functional status. Recently the United States Food and Drug Administration introduced the term patient-reported outcomes, PROs.2 A key component of the FDA definition is that the measure conveys information reported by the patient that is not filtered by an observer or clinician. In the United Kingdom the term patient-reported outcome measures, PROMs, is widely used. Some authors use the terms HRQL, health status, PROs, and PROMs interchangeably; we and many others do not. Rather we provide the following definitions.

Health Status: A person's current state of health. Typically that includes functional status, morbidity, physiologic outcomes, and some notion of well-being.3

Functional Status: Starfield: “The capacity to engage in activities of daily living and social activities”.4

Frailty: Fried's definition is the presence of at least three of five factors: (1) unintentional weight loss (10 pounds or more in a year), (2) general feeling of exhaustion, (3) weakness (as measured by grip strength), (4) slow walking speed, and (5) low levels of physical activity.5 Frailty is a risk factor for further decline in functional status and mortality, and can be associated with a wide variety of chronic conditions.

The concepts health status, functional status, and frailty, an important type of functional status for older populations, focus on a description of the current state of health of the subject. As noted below in the definition of HRQL used in this paper, the concept of HRQL includes health status but goes further by including the value attached to that health status.

Health-Related Quality of Life: There are a wide variety of definitions of HRQL. Some focus on the domains of health status that comprise HRQL, usually including physical health, mental health, social and role function, and pain and discomfort. Patrick and Erickson provide a useful definition (1993, p 22).3 “Health-related quality of life is the value assigned to duration of life as modified by the impairments, functional states, perceptions, and social opportunities that are influenced by disease, injury, treatment, or policy.”

Classification of Health-Related Quality of Life Measures

One taxonomy focuses on the types of persons for whom the measure is applicable.6 Generic measures typically include both physical and mental health, are applicable to virtually any adult population, and can be used to make comparisons across diseases and conditions. There are two major categories of generic measures:7 health profiles such as the Short-Form 368 and preference-based measures such as the Health Utilities Index.9 Each of these will be discussed in more detail below. Specific measures are applicable to people with a particular disease (breast cancer), condition (frailty), or symptom (pain). Specific measures are often more responsive to change than generic measures10,11 but may not capture the effects of comorbidities, do not allow for comparisons across conditions, and thus have limited usefulness for cost-effectiveness analyses and other broader analyses. Some generic measures have condition-specific adaptations.12

Measures can also be classified by their intended purpose.6,13,14 Evaluative measures capture “within person change” over time. Discriminative measures detect differences among groups (or individuals) at a point in time. Many measures of functional status were designed for this purpose. In practice, however, most measures are used for both purposes. Because systematic reviewers are interested in assessing the effectiveness of interventions, our focus is on evaluative applications of measures and the measurement properties that are important for assessing change over time.

It is also useful to note that there are three major intellectual paradigms upon which most measures of functional status and HRQL are based: psychometric, clinimetric, and economics/decision science.15 The psychometric paradigm draws from psychology and typically relies on a latent-variable model.16 In this paradigm, the underlying construct is not measured directly but rather the items in a measure reflect that construct. The clinimetric tradition builds a measure by selecting items that are important to patients with that condition or problem; this approach is often used to develop specific measures. Finally the economics and decision science paradigm, like the clinimetric tradition, selects domains and items on the basis of their importance to patient or members of the general population. The economics/decision science paradigm also focuses on the value attached to the health state, typically on a scale in which dead = 0.00 and perfect health = 1.00, thus enabling the integration of morbidity and mortality. There is considerable cross-fertilization among the three paradigms. Examples of measures based on each of these intellectual traditions are described below.

Quality-Adjusted Survival

The goal of interventions is to improve functional status or HRQL outcomes for older adults or reduce the rate of decline in their functional status or HRQL. That is, the goal is to improve quality-adjusted survival.17,18 A unique feature of preference-based measures is their ability to integrate mortality and morbidity and provide estimates of quality-adjusted survival or quality-adjusted life years gained. Preference-based (utility) measures are on a scale in which 0.00 = dead and perfect health = 1.00. Preference or utility scores are derived directly using choice-based techniques such as the standard gamble and time-tradeoff or through the use of multi-attribute utility measures.19 In the standard gamble, the subject is given a choice between remaining in an impaired state of health for sure or taking a lottery with a probability p of achieving perfect health and probability 1-p of dead. The probability at which the subject is indifferent between the lottery and the sure thing provides an estimate of the value attached to the sure-thing health state. Similarly, in the time-tradeoff, the subject places value on a health state by determining the number of years in that state she/he would be willing to give up to enjoy a shorter period in perfect health.20 In the multi-attribute approach, the subject completes a questionnaire based on the measure; examples of multi-attribute measures include the EQ-5D21 and Health Utilities Index (HUI).9 The health status of the subject obtained by completing the questionnaire is then valued using a scoring function for that measure based on community preferences. Given their ability to provide estimates of quality-adjusted survival, preference-based measures have a special role in evaluating interventions in older populations. Further detail on preference-based measures can be found in Torrance 1986.19,22,23


There are three key categories of measurement properties: reliability, validity, and responsiveness (see Table 1 at the end of the paper for brief definitions).

Table 1. Brief definitions of important measurement properties.

Table 1

Brief definitions of important measurement properties.

Reliability. A reliable measure is consistent and reproducible. Internal consistency is the extent to which items intended to assess health or functional status in a particular domain are correlated with each other and not correlated with items intended to measure other domains. Internal consistency is often measured with Cronbach's alpha. Scores > 0.70 are usually considered to have acceptable internal consistency for group comparisons.24

Intra- and Inter-Observer Reliability. This form of reliability examines the agreement between two raters—for instance, self-assessment at two points in time (intra-rater) or self- and proxy-assessment (inter-rater). The intra-class correlation coefficient (ICC [continuous response scale]) or kappa statistic (categorical responses) is used to assess the extent of agreement; kappas and ICCs > 0.70 are generally regarded as acceptable.24

Test-Retest Reliability. Test-retest reliability examines the agreement among scores in stable persons at two points in time. The interval between testing is generally one to two weeks—long enough that the person is unlikely to recall their previous response and short enough that it is unlikely the condition of the person has changed. Again, ICCs > 0.70 are regarded as acceptable for group comparisons. A good measure provides stable scores for stable persons.

Content Validity. Content validity is the “extent to which the items are sensible and reflect the intended domain of interest.”13 Does the content of the measure make sense? Are the items included relevant to the domain of interest? Do the items cover the full range relevant to that domain? Are the items comprehensible to respondents? There is no formal statistical test to evaluate content validity. In practice, content validity is evaluated using a structured set of criteria, including those listed above.25-27 Face validity, “the degree to which the items indeed look as though they are an adequate reflection of the construct to be measured,” is a sub-category of content validity.26

Criterion Validity. Criterion validity is the extent to which a measure agrees with a gold standard measure (the criterion). Predictive validity relies on criterion validity. For instance, in the question, “Does baseline self-rated health predict admission to a nursing home or mortality?”; mortality or nursing home admission is regarded as the criterion. In applications other than the assessment of predictive validity, the field of HRQL lacks gold standards and thus relies on the evaluation of construct validity.

Construct Validity. Construct validity is a measure's ability to perform as expected. It involves specifying a priori hypotheses about how the measure should perform based on an underlying model or conceptual framework, testing those hypotheses, and accumulating evidence over time and across settings. Cross-sectional construct validity involves making comparisons at a point in time. In convergent validity we expect a high correlation between two different measures of the same concept or measures of highly related domains such as mobility and self-care, or anxiety and depression. In discriminant validity we expect little or no correlation between measures of domains that are unrelated, such as vision and pain. Another strategy for assessing construct validity is known-groups comparisons. For example, we would expect the scores for a measure of mobility to be systematically related to known groups based on the categories in the New York Heart Association functional classification system.28

Responsiveness (Longitudinal Construct Validity). Longitudinal construct validity measures within-person change over time. Does the measure capture meaningful change when it occurs? Change scores for those known to have changed (by some other criterion) should exceed change scores for those known not to have changed. For those who have changed, change scores should be systematically related to the degree of change. Measures for which there is substantial evidence of responsiveness in the relevant area enhance the confidence of the reviewer in the validity of the estimates of change.

Responsiveness is often assessed using effect size (ES, the magnitude of the change divided by the standard deviation of baseline scores), the standardized response mean (SRM, the magnitude of change divided by the standard deviation of change scores) or other related measures that are ratios of signal to noise.29 Cohen provides a scheme to interpret the magnitude of ES: small (0.20); moderate (0.50); or large (≥ 0.80) change.30 A related measure, the standard error of measurement (SEM), is also frequently used. SEM is computed as the standard deviation at baseline times the square root of one minus test-retest reliability.31

The Distinction Between Predictive Validity and Responsiveness

Predictive validity refers to the ability of a baseline score to predict subsequent events. For instance, in both population health survey and clinical studies, self-rated health (SRH) (excellent, very good, good, fair, or poor), has been shown to predict mortality, admission to nursing homes, and other major health outcomes.32-41 However, as there are only five options, the responsiveness of SRH is limited. Predictive validity does not necessarily imply that a measure will be able to detect within-person change over time. Further, predictive validity is based on the association between a baseline value and a subsequent outcome. In contrast, responsiveness instead focuses on the degree of change between the baseline and followup assessments.

How Should Reviewers Approach These Measurement Properties?

In assessing the measurement properties of functional status and HRQL measures there are a number of key questions.42 How extensive is the evidence on the relevant measurement properties, especially responsiveness and interpretability, of the measures? How rigorous is that evidence? Is the evidence directly applicable to the issues at hand? Evidence on cross-sectional and longitudinal construct validity and interpretation is central to evaluating the effects of interventions. Construct validity involves the accumulation of evidence. The interpretation of that evidence also involves subjective judgments. If a systematic reviewer is confident that the measure is valid and responsive in the setting being reviewed, the reviewer can be more confident in the evidence on the effectiveness of an intervention. If the evidence on validity and responsiveness of the measure in that context is equivocal, interpreting results based on that measure will be challenging. The focus in this paper is on using evidence for making group-level comparisons rather than using evidence for making individual-patient-level decisions. The same methodological issues are relevant both for measures of functional status and HRQL.

Reviewing Measures of Health-Related Quality of Life: Special Considerations for Older Adults Populations

Many measures of functional status and HRQL were not designed specifically for use in older adult populations. A useful review both of generic measures that have been applied to older populations and older-population specific measures is provided in Haywood and colleagues.43 Further, evidence on the construct validity and responsiveness of many measures is based on studies in populations whose mean age was 64-86, but age ranges vary by measure.43,44 Extensive evidence on the reliability and validity of a measure does not necessarily imply that there is abundant evidence supporting its use among older adults, especially those at the upper extremes of the age ranges. Potential ceiling and floor effects, discussed below, are also very important in the context of studies of older adults.

To illustrate this we briefly review measurement properties for several widely used generic measures of HRQL: the Short-Form 36 (SF-36) and its preference-based version, the Short-Form 6D (SF- 6D or Six Dimensions), EuroQol-5D (EQ-5D), the Health Utilities Index Mark 3 (HUI3), and the Quality of Well-Being Scale (QWB). The SF-36 includes eight domains: physical functioning (PF), role-physical, bodily pain, general health, vitality, social functioning, role-emotional, and mental health.8,45 The EQ-5D includes a five attribute health-status classification system: mobility, self-care, usual activity, pain/discomfort, and anxiety/depression, with three levels per attribute: no problem, some problem, or extreme problem.21 The HUI3 system includes eight attributes: vision, hearing, speech, ambulation, dexterity, emotion, cognition, and pain and discomfort, with five or six levels per attribute, from severely impaired (“so unhappy that life is not worthwhile”) to no problem or normal (“happy and interested in life”).46 The original version of the Quality of Well Being Scale (QWB) included three attributes (mobility, physical activity, and social activity) and a problem/symptom complex.47 The more recent QWB-SA (self-administered) retains the same structure but includes fewer levels within each attribute and fewer problems/symptoms.48

How well do these measures work in older adults? In a prospective cohort study of patients 75+, Brazier and colleagues examined test-retest reliability in patients who self-identified as stable: patients who indicated that their health had not changed. Correlations for domains of the SF-36 ranged from 0.28 to 0.70; the correlation for EQ-5D scores was 0.67.49 In a paper based on one of the original Medical Outcome Study (MOS) surveys (n = 3,445), one of the major studies upon which the SF-36 is based, McHorney and colleagues reported lower completion rates by item for those ≥ 75 than for the 65-74 group, who in turn had lower completion rates than persons <65. However, estimates of internal consistency reliability (Cronbach's alpha) did not vary by age, education, poverty status, diagnosis, or disease severity.50

A study of patients 65+ who identified themselves as stable reported intraclass correlation coefficients (ICCs) for SF-36 domains ranging from 0.65 to 0.87. Andresen and colleagues also showed evidence of cross-sectional construct validity for the SF-36 in that domain scores were lower for those who were older and for those with more severe comorbidities.51

Naglie and colleagues reported test-retest reliability estimates for patients with mild (mini-mental state examination [MMSE] scores 19-26) or moderate (MMSE 10-18) cognitive impairment, and proxy family caregivers for three generic preference-based measures, EQ-5D, HUI3, and the QWB52. Follow-up assessments were done approximately 2 weeks after the initial assessment. Examining consistency between initial and re-test responses by patients, the ICCs for the entire cohort were 0.79 (EQ-5D), 0.47 (HUI3), and 0.70 (QWB), respectively; for those with mild cognitive impairment the ICCs were 0.70, 0.75, and 0.81, respectively; for moderate impairment 0.83, 0.25, and 0.59, respectively. Examining consistency between initial and re-test proxy responses, the ICCs were 0.71, 0.81, and 0.70, respectively. The results for HUI3 and the QWB were sensible; test-retest reliability for those with mild cognitive impairment was reasonable but persons with moderate cognitive impairment were not reliable respondents.52 But for HUI3 and the QWB test-retest reliability was much lower for subjects with moderate cognitive impairment. This result has implications for the use of proxy respondents for subjects with moderate and severe cognitive impairment, a topic which is discussed below.

Two generalizations emerge from the studies reporting results for SF-36 and the Naglie and colleagues paper. First, the severely cognitively impaired are, in general, not capable of providing reliable and valid responses. Second, if the highly cognitively impaired are excluded, reliability in samples of older adults appear to be of the same order of magnitude as in general adult samples.

Floor and Ceiling Effects

If the range of function covered by a measure is less than the range experienced by patients, especially frail older adults, the measure may lack responsiveness. The potential for floor and ceiling effects is often assessed by examining response patterns. If there are spikes at the highest or lowest response option this is often interpreted as evidence of ceiling or floor effects, respectively. However, when using measures to assess the effectiveness of interventions prospective evidence of the performance of a measure is more important than whether or not there are spikes. Results from longitudinal studies indicate that the SF-36 (and therefore SF-6D) has well known floor effects that have been recognized in a wide variety of clinical settings and samples.53-67 In a prospective cohort study comparing utility scores before and after elective total hip arthroplasty a gain of 0.10 was registered by SF-6D and a gain of 0.23 by HUI3.55 (It should be noted that Version 2 of SF-36 is less prone to floor effects than Version 1. However, floor effects have been observed in studies using both versions.) In a natural history cohort of 124 patients recruited shortly after a stroke and followed for 6 months, the gain in overall HRQL observed in the 98 survivors (18 lost to followup) was 0.24 according to the EQ-5D21 and 0.25 according to the HUI3,46 but only 0.13 according to SF-6D.68 Floor effects attenuated the ability of SF-36 and SF-6D to capture gains when many patients had moderate or severe burdens at baseline. The magnitude of improvement experienced by patients was underestimated because some patients were “worse off” than the measure could capture before the intervention; this underestimation could seriously bias estimates of the magnitude of change associated with interventions and cost-effectiveness estimates of those interventions.

Similarly, ceiling effects can threaten responsiveness. The absence of levels for mild problems in the EQ-5D probably accounts for the ceiling effects associated with the measure in population health survey and clinical applications. In a review of generic preference-based measures used in studies of patients with rheumatoid arthritis, ceiling effects associated with EQ-5D attenuated its responsiveness.52,69-71 Similarly, a lack of responsiveness of EQ-5D has been reported in clinical studies of urinary incontinence in females 72 and treatments for leg ulcers.73 The recently developed five-level EQ-5D may reduce ceiling effects.74 Ceiling effects in population health surveys have also been observed for HUI2 and HUI3.75,76

Proxy Respondents

Cognitive impairment or physical disability may attenuate older adults' ability to respond, and this situation may be temporary or chronic. One approach to this problem is to rely on a proxy respondent—a family member or caregiver who is familiar with the subject's current status. Agreement between self and proxy report then becomes an important issue; the “proxy as an agent” (if “X” could respond, what would she/he say) must be distinguished from the “proxy as an informed observer” (which of the following best describes the current condition of “X”). Most investigations of agreement have adopted the informed-observer approach.

Magaziner and colleagues examined agreement between self- and proxy-report in a prospective cohort of patients ≥ 65 (n = 361) being followed after hip fracture.77 Both sets of respondents independently completed questionnaires on activities of daily living, instrumental activities of daily living, mental status, and depressive symptoms. Proxies tended to rate patients as more disabled than the patients rated themselves. Agreement was higher when the proxy and subject lived together; agreement was also higher when the proxy was a sibling or spouse as compared with offspring and nonrelative. Even mild cognitive impairment in the patient was associated with less agreement. Agreement was often lower on less observable aspects of physical and mental health.

Clearly, in studies that gathered responses from both patients and their proxies, the responses were not interchangeable. The degree of agreement was affected by the observability of that aspect of health status, the degree of familiarity of the proxy with the current condition of the patient, and in some cases, the burden being experienced by the proxy caregiver.78 The extent to which agreement varies with respect to these factors varies across studies. However, in general, these factors are associated with quantitatively important differences in the degree of agreement. Nonetheless, the results indicate a reasonable amount of agreement. Furthermore, evidence suggests that more reliable and valid information is available from proxy respondents who have frequent contact with patients who are becoming incapable of responding than is available from the patient directly. Whether differences in source of measure, patient versus proxy, impacts results in a systematic review could be evaluated through meta-regression or other techniques. Failure to obtain data from proxy respondents entails a substantial risk of overestimating the health of a cohort because the most severely affected will often not be able to respond.79 Thus, in general, it is wise to collect both self and proxy assessments and to analyze them separately.


  • PubReader
  • Print View
  • Cite this Page
  • PDF version of this title (367K)

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...