Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
Neuropsychology. Author manuscript; available in PMC 2009 Nov 1.
Published in final edited form as:
PMCID: PMC2593909

Implications of short-term retest effects for the interpretation of longitudinal change


Although within-person comparisons allow direct assessments of change, some of the observed change may reflect effects associated with prior test experience rather than the processes of primary interest. One method that might allow retest effects to be distinguished from other influences of change involves comparing the pattern of results in a longitudinal study with those in a study with a very short retest interval. Three short-term retest studies with moderately large samples of adults are used to provide this type of reference information about the magnitude of change, test-retest correlations, reliabilities of change, and correlations of the change in different cognitive variables with each other, and with other types of variables.

Keywords: longitudinal change, practice effects, maturation, aging

It is widely recognized that a major advantage of longitudinal designs over cross-sectional designs is that within-person change can be measured directly instead of being inferred indirectly from comparisons of different people. Changes observed in longitudinal comparisons are usually attributed to influences operating during the interval between successive measurement occasions, with the nature of the influences varying according to the specific substantive focus of the research. For example, in developmental studies most of the influences are assumed to reflect processes related to maturation, in intervention studies the influences are assumed to reflect processes related to the treatment, and in studies of disease progression the influences are assumed to reflect factors associated with the course of the underlying pathology. Issues of interpreting longitudinal change are therefore quite general, but for the sake of simplicity the following discussion will emphasize a developmental perspective in which processes of maturation are the primary change influences of interest.

Inferences about various aspects of change can be derived from different properties of longitudinal data. For example, the mean change from the first to the second measurement occasion is usually interpreted as a reflection of the magnitude of maturation influences operating across the retest interval. Second, the strength of the correlation between scores on successive occasions is sometimes used as an indirect indication of the amount of individual difference variation in change because these stability correlations can be expected to decrease with increases in the magnitude of individual differences in the size, and direction, of longitudinal change. Third, an inference that maturation affects something that is common to multiple variables might be reached when several variables are available from the same individuals, and the changes in different variables are found to be correlated. And finally, correlations of the measures of change with other variables are often used to identify possible moderators of cognitive aging. To illustrate, a finding that a higher level of education was associated with less negative change could lead to an inference that people with the greatest “cognitive reserve” (e.g., Stern, 2003) are more resistant to age-related cognitive decline.

Although the preceding inferences are often valid, longitudinal comparisons involve successive testing of the same individuals, and thus it is possible that at least some of the observed within-person change in performance is attributable to effects of prior test experience rather than to influences related to maturation. Retest effects are frequently ignored as an influence on longitudinal change, particularly in research on aging, because they are often assumed to be very small or short lasting. However, recent research indicates that retest gains can average .40 standard deviation (SD) units or more (for a recent meta-analysis see Hausknecht, Halpert, DiPaolo & Gerrard, 2007), and can be detected up to 5 (Burke, 1997) and even 12 (Salthouse, Schroeder & Ferrer, 2004) years after the initial test.

A number of methods have been developed to take retest effects into account when evaluating change. One such method within the field of neuropsychological assessment is the reliable change index (e.g., Chelune, Naugle, Luders, Sedlak & Awad, 1993; Frerichs & Tuokko, 2005; Knight, McMahon, Skeaff & Green, 2007). The primary rationale for our approach, however, is that methods to correct for the influences of retests effects can only be strongly justified, and eventually improved upon, after retest effects are fully characterized and understood. Moreover, in contrast to the reliable change index approach, we emphasize a multivariate perspective in which relations among short-term retest effects in different variables are of interest, and not just the magnitude of retest effects in a single variable.

A key assumption of the research described in this article is that maturation and retest influences might be distinguished with very short-term longitudinal studies, in which the intervals between tests are in the range of days instead of years. The rationale is that little or no influences associated with maturation are likely to be operating with short intervals, and therefore any changes evident under these conditions can be inferred to primarily reflect retest effects. Results from longitudinal studies with very short retest intervals might therefore provide a valuable baseline for interpreting results from conventional longitudinal studies in which the intervals between tests are one year or longer. Some allowance must be made for the possibility that retest effects are likely to decay over time, but as noted above, the interval until no effects are detectable could be as long as 12 years. In this article we report analyses similar to those described above with data from longitudinal studies involving retest intervals averaging about one week to illustrate how conclusions from traditional longitudinal studies can be misleading if results from studies with very short retest intervals are not considered.

The data were obtained from three studies in which moderately large samples of adults ranging from 18 to over 80 years of age performed the same battery of 16 cognitive tests either two or three times, with intervals between the tests ranging from one day to a few weeks. The participants in Studies 1 and 2 performed different versions of the tests on each of three sessions, with the Study 1 participants tested in 2004, 2005, and 2006, and the Study 2 participants tested in 2007. Although Studies 1 and 2 were identical, they are reported separately to allow a comparison of change on tests with same and different items without a confounding of year of testing. That is, the participants in Study 3, who like those in Study 2 were also tested in 2007, performed exactly the same tests with identical items on the first and second sessions.

Change in two-occasion longitudinal comparisons is typically assessed in one of two ways. The simplest method is with a difference score computed by subtracting the score at the initial occasion (T1) from the score at a later occasion (T2). A second method involves computing a residual score by statistically partialling the influence of the score on the first assessment from the score on the second assessment. The two methods are related as both can be conceptualized in terms of a contrast of the T2 score with T′, where T′ is equal to a + b(T1). However, in the difference score method the values of a and b are fixed at 0 and 1, respectively, whereas in the residual score method these two parameters are estimated from the data with a least-squares regression equation (cf. Cohen, Cohen, West & Aiken, 2003, p. 570). Both measures of change are examined in the current report to illustrate potential differences in the patterns of results with the two methods of examining change.

Although estimates of the reliability of measures of change are seldom reported, this information is important for the interpretation of correlations because correlations of changes with other variables are limited by the reliabilities of the measures of change. The data in the current project were recorded at the level of individual items for every participant in each test, and thus separate scores could be computed for the odd-numbered items and for the even-numbered items on each session. This allowed differences and residuals to be computed for the odd and even items, which were then treated as units of analysis in estimating coefficient alpha reliability of the measures of change.

Two individual difference variables, age and general cognitive ability, were also examined with respect to their relations with the measures of short-term change. An estimate of general cognitive ability was created from the first principal component (1st PC) obtained in a principal components analysis based on all of the variables from the first session. An advantage of this method of assessing general cognitive ability is that the 1st PC represents the largest mathematical overlap of the variance among all variables, and involves minimal assumptions about what specific variables represent.



Characteristics of the participants in the three studies are reported in Table 1. All participants were recruited from newspaper advertisements, flyers, and referrals from other participants, and were paid for their participation. The data in Study 1 were aggregated across several studies originally designed for another purpose, and some of the data were previously analyzed for a study of within-person variability (Salthouse, 2007). Studies 2 and 3 are new and no prior analyses of those data have been published. None of the participants had scores below 24 on the Mini Mental Status Exam (Folstein, Folstein & McHugh, 1975) that is often used to screen for dementia. Inspection of the entries in Table 1 reveals that the average amount of education was greater than 15 years, and that the average rating of health was in the “very good” to “excellent” range. One method that can be used to evaluate the representativeness of a sample involves examining scores on standardized tests relative to the test norms. It is apparent in Table 1 that the means of the age-adjusted scaled scores for four standardized tests were about one-half to one standard deviation above the averages of the nationally representative samples used to establish the norms for those tests. The participants in the current studies can therefore be inferred to have somewhat higher average levels of cognitive abilities than people in the general population, perhaps because they were self-selected volunteers. However, it is important to note that this is true to nearly the same extent at each age, and therefore there is no evidence that certain age groups had higher ability levels than others with respect to the population norms. It should also be noted that the standard deviations of the scaled scores were close to 3, the value in the normative samples, which indicates that these samples exhibited nearly the same amount of variability as the normative samples that were selected to be representative of the U.S. population.

Table 1
Descriptive characteristics of the samples


The cognitive tests are listed in the appendix, and have been described in detail in several other publications (e.g., Salthouse, 2004; 2005; 2007; Salthouse, Atkinson & Berish, 2003; Salthouse, Berish & Siedlecki, 2004; Salthouse & Ferrer-Caja, 2003; Salthouse, Siedlecki & Krueger, 2007). The 16 tests were selected to represent five major cognitive abilities (i.e., reasoning, spatial visualization, episodic memory, perceptual speed, and vocabulary) that have been well established in the cognitive psychometric literature (e.g., Carroll, 1993; Salthouse, 2004). Although not all of these tests are frequently used in neuropsychology, earlier research has established that they have moderate to large correlations with common neuropsychological tests such as the Wisconsin Card Sorting Test, Tower of Hanoi, Stroop, Trail Making, and various fluency tests (e.g., Salthouse, 2005).

Different versions of the tests were performed on each of the three sessions in Studies 1 and 2. The scores on the versions administered on the second and third sessions were equated to the first session means with regression equations based on data from 90 individuals who received the three versions in a counterbalanced order (cf., Salthouse, 2007). Identical versions of the tests were presented in the first and second sessions in Study 3, with the third session containing different types of tests for a separate project. Because the sessions were scheduled according to the participants’ availability, the intervals between sessions ranged from 1 day to over 30 days. Means and standard deviations of the retest intervals are presented in Table 1 where it can be seen that the average interval between test sessions was less than 7 days in each study.

Analysis Plan

Six sets of analyses were conducted to address the different aspects of change discussed in the introduction. The initial analyses were conducted to explore properties of the data sets and involved examining the effect of the length of the retest interval on the magnitude of change, and the structural relations among variables across sessions and across studies. The next analyses investigated the magnitude of the retest changes, and the magnitude of the correlations between the scores on successive sessions. The remaining analyses focused on change scores, with the first set examining reliability, and the second set examining intercorrelations among the changes in different cognitive variables. The final analyses examined correlations of age and general cognitive ability with the short-term changes.


An initial set of analyses examined relations between the length of the interval between the first and second sessions and the magnitude of the changes in test performance. The analyses consisted of correlations between the length of the interval and the session 2 residual score for each variable. None of the correlations were very large, there were nearly as many positive as negative correlations, and the median correlation was −.01. It therefore does not appear that there was much, if any, effect of the length of the interval between sessions on the retest gains in these studies, and thus the retest interval variable was ignored in subsequent analyses. However, it should be noted that the range of retest intervals was highly restricted, with the intervals for most of the participants ranging between 1 and 10 days, and influences of the length of the retest interval might be apparent with greater variation in the intervals.

A second set of analyses consisted of confirmatory factor analyses on the 16 variables from each session in each study. The results of these analyses closely resembled those from other samples (see Salthouse, Pink & Tucker-Drob, in press). Of particular importance in the current context is that the patterns were also very similar across the sessions within each study as the congruence coefficients (cf., Jensen, 1998) were all greater than .95. The finding of nearly constant relations among the variables suggests that the variables have the same meaning at each session, and in each study.

Average change

As noted above, each session in Study 1 involved different versions of the cognitive tests. Mean levels of performance for the cognitive variables on sessions 2 and 3 in this study, expressed in standard deviation (SD) units from the scores on the first session, are portrayed in Figure 1. Because zero in this figure represents the average performance on the first session, the heights of the bars represent the size of the retest gains from the first session, scaled relative to the magnitude of individual differences on the task2. The magnitude of a given bar therefore corresponds to an estimated effect size for the retest gain, with the standard error bar indicating the precision of the estimate. Because a value that differs from zero by 2.33 standard errors is significant at the .01 significance level, means that are more than 2.33 standard errors from zero are statistically significant. Inspection of the figure reveals that for most variables the largest gains were from the first to the second assessment, with much smaller gains from the second to the third assessment.

Figure 1
Scores on sessions 2 and 3 in standard deviation units from the scores on session 1, Study 1. Bars above each column are standard errors.

There was some variation in the pattern of retest gains across cognitive abilities as the mean gains were small for reasoning variables, modest for memory variables, and large for the spatial visualization and speed variables. However, there was also variation in the magnitude of the retest effects within the same cognitive ability domain. For example, the average gain from the first to the second assessment was fairly small for the Form Boards variable, but relatively large for the Paper Folding and Spatial Relations variables.

Figure 2 uses the same format as Figure 1 to portray scores for Study 2 (with different test versions on the second test session), and for Study 3 (with identical test versions on the second test session). Note that the vertical axis for the episodic memory variables is in a different scale than the other variables to accommodate the large gains evident in some of these variables when successive tests contain identical items. Comparison of the black bars across Figures 1 and and22 reveals that the patterns of retest changes with different test versions on the first and second sessions were very similar in Studies 1 and 2. This finding is not surprising because the studies were exact replications of one another, differing only with respect to the years of testing. Examination of the black and gray bars in Figure 2 reveals that the pattern of changes for identical and different test versions varied across cognitive tests. To illustrate, the gains for identical versions (gray bars) were generally larger than the gains for different versions (black bars) in the reasoning and memory tests, but they were nearly the same magnitude for most of the speed and spatial visualization tests. Independent groups t-tests revealed that the gains for identical versions were significantly (p<.01) greater than the gains for different versions for the Shipley, Form Boards, Word Recall, and Logical Memory tests, but surprisingly were significantly greater for the different version than for the identical version of the Spatial Relations test.

Figure 2
Scores on session 2 in standard deviation units from the scores on session 1. Study 2 involved different items on the two sessions, and Study 3 involved the same items on both sessions. Bars above each column are standard errors.

Scores for the vocabulary variables are portrayed in Figure 3, with values for sessions 2 and 3 in Study 1 on the top, and values in session 2 for Studies 2 (different versions) and 3 (same versions) in the bottom. It is apparent that the means of the vocabulary variables in sessions 2 and 3 were relatively small when scaled in session 1 SD units, indicating very little performance gain with retesting. Furthermore, the changes in the vocabulary tests were generally similar across tests with same and different items, with the exception of a significantly larger retest gain when identical items were repeated in the Picture Vocabulary test.

Figure 3
Vocabulary scores on sessions 2 and 3 in Study 1(top) and on session 2 in Studies 2 and 3 (bottom) in standard deviation units from the scores on session 1. Bars above each column are standard errors.

Test-retest correlations

Table 2 contains the correlations of the scores across the first two sessions for tests with different items in Studies 1 and 2, and for tests with identical items in Study 3. Medians of the correlations were .69 for Study 1, .75 for Study 2, and .82 for Study 3. Comparison of the correlations in Studies 2 and 3 with t-tests on Fisher r-to-z transformed correlations revealed that the correlations with identical versions (Study 3) were significantly (p<.01) greater than those with different versions (Study 2) for the Shipley, Letter Sets, Letter Comparison, Spatial Relations tests, and for all four vocabulary tests. Because identical test versions were used on both sessions in Study 3, those values can be interpreted as test-retest reliability coefficients. The moderately high stability coefficients imply that individual differences in change were small relative to the individual differences in the initial scores. Direct computations of the variances in the difference and residual measures of change confirmed this implication. That is, the median variance of the differences and residuals across the three studies were .45 for the difference scores and .28 for the residuals. Because both differences and residuals were assessed in z-score units scaled relative to the distribution of initial scores, and because z-scores have variances of 1.0, these values indicate that individual differences in the change scores were only about one-fourth to one-half the magnitude of the individual differences in the original scores.

Table 2
Correlations between scores on the first and second measurement occasions

Measurement error can be minimized by forming latent constructs at each occasion, and then examining the across-session correlations at the level of latent constructs. These latent construct analyses were carried out using the AMOS (Arbuckle, 2007) structural equation modeling program, with separate analyses for each construct, and correlations allowed between the residuals for each variable to account for variable-specific relations across sessions. The latent construct correlations are presented in the bottom of Table 2, where it can be seen that they are all close to 1.0. When examined at the level of latent constructs, therefore, individual differences in short-term change can be inferred to be either extremely small, or possibly even non-existent.

Reliability of short-term change

Because accuracy was recorded for every item in each test, scores could be computed for the odd-numbered and even-numbered items on each session, as well as for the differences and residuals across sessions. These “odd” and “even” scores were then used to compute coefficient alpha reliability for the session 1 scores, and for the differences and residuals across sessions 1 and 2. The estimated reliabilities computed in this manner are summarized in Table 3. The values in the first three columns are reliability estimates for the session 1 scores, values in the next three columns are estimates of the reliability for the differences, and those in the last three columns are estimates of the reliability for the session 2 residual scores. Across the three studies, the median estimated reliability for the session 1 scores was .85, and the corresponding medians for the reliabilities of the differences and residuals were .32 and .42, respectively. It is clear from these results that reliability is much lower for the measures of change than for the scores on the initial session.

Table 3
Estimates of reliability for scores on the first session and for differences and residuals

Correlations among the short-term changes

One of the simplest methods of examining the pattern of interrelations among variables is with exploratory factor analyses. Because the structural relations among the variables are not necessarily similar for session 1 scores and for differences or residuals, separate exploratory factor analyses were conducted on each type of variable. Across the three studies the first factor was associated with between 40.3% and 42.7% of the variance for the session 1 scores, but with only from 10.5% to 17.3% of the variance for the differences or for the residuals. Five factors accounted for between 75.9% and 78.1% of the variance in the session 1 scores, but for only between 40.3% and 54.5% for the differences and for the residuals. Furthermore, the pattern of weaker relations among the measures of change was still evident when the analyses were repeated after adjusting each correlation for unreliability.

Correlations of the changes were also examined among the variables within each domain of cognitive ability. Because residuals are independent of the initial scores, they are the most meaningful measures of change for these analyses. The pattern was very similar in each study, and thus only medians across studies are reported. These medians were .15 for Reasoning, .11 for Space, .18 for Memory, .10 for Speed, and .06 for Vocabulary. For purpose of comparison, the corresponding correlations among the session 1 scores were .64 for Reasoning, .62 for Space, .51 for Memory, .66 for Speed, and .69 for Vocabulary. As was the case with the reliabilities, therefore, the values for the change measures were markedly weaker than those for the scores on the initial session.

Predictors of Individual differences in short-term changes

An estimate of general cognitive ability was created from the first principal component (1st PC) obtained in a principal components analysis based on all of the variables from the first session. As noted above, this component was associated with between 40.3% and 42.7% of the variance in the three studies, and its correlations with age in Studies 1, 2, and 3 were, respectively, −.55, −.40, and −.47.

Table 4 contains simple correlations of age and of the 1st PC for the difference scores and residuals in each study3. Because of the negative relation between age and general cognitive ability, some of the relations of the change measures with age are probably mediated through effects on cognitive ability. Indeed, the unique age-related effects, obtained from analyses in which age and the 1st PC were simultaneous predictors of the target variable, were consistently smaller than those reported in Table 4.

Table 4
Correlations of age and an estimate of general cognitive ability (1st PC) on difference and residual estimates of change

Many of the difference scores were positively related to age, and negatively related to general cognitive ability. However, a reverse pattern was evident for the residuals as many of them were negatively related to age, but positively related to general cognitive ability. This reversal is likely attributable to the negative relations between the differences and the original scores, as the median correlations between the T1 score and the T2 − T1 difference were −.66 in Study 1, −.68 in Study 2, and −.17 in Study 3.


It is apparent in Figures 1 and and22 that there was considerable variation across cognitive variables in the magnitude of the average change from the first to the second session, from the second to the third session, and according to whether successive tests contained identical items or different items. The spatial tests tended to have large average gains, possibly because the items in these tests are unfamiliar to most people. Relatively large gains were also evident on the perceptual speed tests, perhaps because they involve a somewhat novel mode of behavior. The gains were small on reasoning and memory tests when the test versions involved different items, but the increase in performance for some memory tests was as much as .75 SD units when the tests on both sessions consisted of identical items.

These results are consistent with earlier reports in several respects. For example, significant short-term retest gains have been reported in a variety of different cognitive tests (e.g., Basso, Bornstein & Lang, 1999; Benedict, 2005; Benedict & Zgaljardic, 1998; Dikmen, Heaton, Grant, & Temkin, 1999; Duff, Beglinger, Schoenberg, Patton, Mold, Scott & Adams, 2005; Knight, McHahon, Skeaff & Green, 2007; Lemay, Bedard, Roulea & Tremblay, 2004; Levine, Miller, Becker, Selnes & Cohen, 2004; Lowe & Rabbitt, 1998; Reeve & Lam, 2005; Salinsky, Storzbach, Dodrill & Binder, 2001; Theisen, Rapport, Axelrod & Brines, 1998; Wilson, Watson, Baddeley, Emslie & Evans, 2000; Woods, Delis, Scott, Kramer & Holdnack, 2006). Furthermore, several studies with three or more test sessions have found that the greatest gain occurs from the first to the second assessment (e.g., Begliner, Gaydos, Tangphao-Daniels, Duff, Kareken, Crawford, Fastenau & Siemers, 2005; Benedict & Zgaljardic, 1998; Collie, Maruff, Darby, & McStephen 2003; DeMonte, Geffen & Kwapil 2005; Falleti, Maruff, Collie & Darby, 2006; Hausknecht, Trevor & Farr, 2002; Hausknecht, et al., 2007; Ivnik, Smith, Lucas, Petersen, Boeve, Kokmen & Tangalos, 1999; Lemay, et al., 2004; Rapport, Brines, Axelrod & Theisen, 1997; Reeve & Lam, 2005; Theisen, et al., 1998).

The median short-term change for non-vocabulary variables in successive tests with identical items was .30 SD units. The median cross-sectional age slope for these 12 variables was −.024 SD per year, and thus the effects of a single prior test are larger than what would be expected across more than 10 years of cross-sectional aging. If these effects were ignored, inferences about the magnitude, and even the direction, of maturational change could be very misleading. This basic point has been recognized for many years, but it has not always been appreciated that the retest influence varies considerably across different cognitive variables. For example, the short-term changes with identical test versions were much larger for certain memory tests than for some tests of reasoning and perceptual speed. In a conventional longitudinal study results such as these might be interpreted as evidence that cognitive variables differ in their rates of aging, but because the interval between sessions in the current project averaged less than one week, all of the differences are attributable to variations in the magnitude of short-term retest effects.

The results of the current studies, and of several earlier studies (e.g., Beglinger, et al., 2005; Benedict, 2005; Benedict & Zgaljardic, 1998; Dikmen, et al., 1999; Hausknecht, et al., 2007; Woods, et al., 2006), indicate that for some variables the average retest influences can be minimized, or possibly even eliminated, by the use of alternate forms on successive occasions. However, it is important to note that this is not the case for all variables, because substantial retest gains were apparent in spatial visualization and perceptual speed tests even when the successive tests contained different items.

As noted in the introduction, the magnitude of stability coefficients can be used as an indirect reflection of the amount of between-person variability in change. However, because the test-retest correlations are not 1.0 at intervals ranging from one day to a few weeks, correlations with very short-term retest intervals need to be considered when interpreting test-retest correlations with longer intervals. To illustrate, the short-term stability coefficient for the Matrix Reasoning variable in these studies was about .8, and thus the corresponding value in a conventional longitudinal study would have to be appreciably lower than this to justify a conclusion that people differed in their rates of age-related change on this variable.

The across-session correlations between latent constructs formed from three or more variables at each occasion were very close to 1.0. Stability coefficients for latent constructs in conventional longitudinal studies are also often quite high, but there is seldom any information about the values of the correlations with very short retest intervals. For example, Schaie (2005, Table 8.10) reported correlations across a 7-year interval of .8 or greater for several factor scores, but there was no mention of the correlations across very short intervals that would allow these values to be interpreted as reflections of the magnitude of individual differences in maturational influences.

Most of the estimates of the reliability of the changes were fairly low, which set limits on the relations the measures of change can have with other variables. However, it is noteworthy that there was considerable variation in the reliabilities of the change measures across different cognitive variables. As an example, the estimated reliabilities of the measures of change in the Word Recall variable were in the .6 to .8 range, but the estimated reliabilities of the changes in other variables, such as Matrix Reasoning and Paper Folding, were very low. In a conventional longitudinal study reliability differences such as these could lead to conclusions that some variable, such as physical exercise, cognitive stimulation, type of personality, etc., has greater effects on the age-related changes in memory than on the age-related changes in reasoning, when the differential relations could simply reflect differential reliability of the measures of change. More reliable measures of change might be possible by examining change among composite scores or latent constructs, which will tend to have higher reliability at each occasion than the individual scores contributing to the composite, or by using latent difference score analyses (e.g., McArdle & Nesselroade, 1994). However, such approaches do not, by themselves, distinguish between reliable retest-related change and reliable maturation-related change. For such purposes, more sophisticated modeling procedures (e.g. McArdle & Woodcock, 1997) should be considered.

The discovery of weak structure among the measures of change should not be surprising in light of the low reliabilities. The small correlations among the changes are inconsistent with the idea that different variables, even those representing the same type of cognitive ability, change together across short intervals. Stronger evidence for correlated change might be found in a conventional longitudinal study with longer retest intervals, but it would still be informative to compare the correlations with those from a short-term retest study to distinguish the contribution of correlated retest effects from correlated maturation effects (cf., Ferrer, Salthouse, McArdle, Stewart & Schwartz, 2005).

Another noteworthy finding in the current project is that the direction of the relations of the change measures with other variables depends on how change is assessed. The results in Table 4 reveal that completely opposite conclusions could be reached about the influence of cognitive ability or of age on short-term changes according to whether change was evaluated with difference scores or with residuals. These patterns are likely due to the fact that some of the relations apparent with difference scores reflect relations with the original scores, whereas influences of the original scores are statistically removed with residuals. That is, if age is negatively related to the T1 score then it will tend to be positively related to a difference created by subtraction of the T1 score from the T2 score. Residual measures of change may therefore be more meaningful if one is interested in relations of change measures that are independent of relations among the initial scores.

Many of the residual change measures had negative relations with age, and positive relations with a measure of general cognitive ability. In a conventional longitudinal study correlations such as these might be interpreted as reflecting influences on rates of aging. For example, the negative age correlations might be interpreted as reflecting more rapid decline at older ages, but because the same pattern is apparent with a very short interval between successive tests, the results could actually reflect smaller benefits of prior testing experience with increased age. Furthermore, the finding of a larger increase (or smaller decline) among individuals with higher levels of general cognitive ability is consistent with the pattern sometimes interpreted as evidence for the notion of cognitive reserve (Stern, 2003), but the results cannot reflect effects on the rate of aging when, as in these studies, the interval between measurement occasions is in the range of days instead of years. The positive relation between initial level of cognitive ability and the magnitude of the retest gain is also consistent with the “rich get richer” suggestion by Rapport, et al., (1997), but is inconsistent with recent results by Coyle (2006).

Lower mean levels of performance might be expected on the second assessment when the intervals between tests are longer because of maturation-related declines in ability combined with decay of the retest gains over time. Moreover, if people age at different rates, one might expect relatively low test-retest correlations (i.e., less stability), moderately high reliability of the measures of change, and possibly larger correlations of the measures of change with one another and with other variables. However, the current results with very short-term retest intervals indicate that the values are not 0 (for reliabilities and intercorrelations) or 1.0 (for test-retest correlations) when no maturational influences are operating, and thus the absolute magnitudes of these parameters can only be meaningfully interpreted by considering the corresponding values with very short retest intervals.

The major implication of the current analyses for neuropsychological research is that merely because changes are observed does not mean that neurodegenerative processes related to disease, pathology, trauma, or aging are being evaluated (or at least solely evaluated). Retest effects were found to not only influence mean levels of performance, but also to differentially impact individuals of different ages and ability levels. Conventional examinations rely on the use of predictors of longitudinal changes to make inferences about risk or protective factors associated with cognitive/neuropsychological deficits, but the current results suggest that some of the relations may be attributable to individual differences in the magnitude of retest effects. However, one can have greater confidence that such patterns reflect only the processes of interest when patterns from long retest intervals (or from patient groups) are substantially different from the patterns with very short retest intervals (or from healthy control groups). Although it will likely add to the time and expense of the research, including such “control” observations could greatly increase the interpretability of longitudinal research.


This research was supported by National Institute on Aging Grant R37AG024270 to TAS.


Description of reference variables and sources of tasks

Matrix ReasoningDetermine which pattern best completes the missing cell in a matrixRaven (1962)
Shipley AbstractionDetermine the words or numbers that are the best continuation of a sequenceZachary (1986)
Letter SetsIdentify which of five groups of letters is different from the othersEkstrom, et al. (1976)
Spatial RelationsDetermine the correspondence between a 3-D figure and alternative 2-D figuresBennett, et al. (1997)
Paper FoldingDetermine the pattern of holes that would result from a sequence of folds and a punch through folded paperEkstrom, et al. (1976)
Form BoardsDetermine which combinations of shapes are needed to fill a larger shapeEkstrom, et al. (1976)
Logical MemoryNumber of idea units recalled across three storiesWechsler (1997b)
Free RecallNumber of words recalled across trials 1 to 4 of a word listWechsler (1997b)
Paired AssociatesNumber of response terms recalled when presented with a stimulus termSalthouse, et al. (1996)
Digit SymbolUse a code table to write the correct symbol below each digitWechsler (1997a)
Letter ComparisonSame/different comparison of pairs of letter stringsSalthouse & Babcock (1991)
Pattern ComparisonSame/different comparison of pairs of line patternsSalthouse & Babcock (1991)
WAIS VocabularyProvide definitions of wordsWechsler (1997a)
Picture VocabularyName the pictured objectWoodcock & Johnson (1990)
Antonym VocabularySelect the best antonym of the target wordSalthouse (1993)
Synonym VocabularySelect the best synonym of the target wordSalthouse (1993)


1Because of the moderately large sample size and the relatively large number of statistical comparisons, a significance level of .01 was used for all statistical comparisons.

2It is important to note that because the standard deviations used to scale the retest effects include variation associated with age, the reported retest gains are likely underestimates of what would be obtained in an age-homogeneous sample. That is, because retest effects correspond to the performance differences across the two sessions divided by the first session standard deviation, the effects in an age-restricted sample will be larger by an amount proportional to the ratio of the age-heterogeneous and age-homogeneous standard deviations. The median ratios of the standard deviations of the original scores and of the residuals after partialling the relations of age were computed in Study 1. The medians were 1.03 for the vocabulary variables, and 1.13, 1.15, 1.09, and 1.22, respectively, for the reasoning, spatial visualization, memory, and speed variables. It can therefore be inferred that the size of the retest estimates would likely be 10% to 20% larger in a sample with little or no age variation.

3Because the 1st PC is based on the initial values of all of the variables, it is not necessarily independent of the change score for a particular variable. The analyses for each cognitive variable were therefore repeated after deleting that variable from the principal components analyses. Perhaps because each variable was only one of 16 variables contributing to the 1st PC, the results of these analyses were nearly identical to those reported in Table 4.


  • Arbuckle JL. AMOS 7.0 User’s Guide. SPSS, Inc; Chicago, IL: 2007.
  • Basso MR, Bornstein RA, Lang JM. Practice effects on commonly used measures of executive function across twelve months. The Clinical Neuropsychologist. 1999;10:283–292. [PubMed]
  • Beglinger LJ, Gaydos B, Tangphao-Daniels O, Duff K, Kareken DA, Crawford J, Fastenau PS, Siemers ER. Practice effects and the use of alternate forms in serial neuropsychological testing. Archives of Clinical Neuropsychology. 2005;20:517–529. [PubMed]
  • Benedict RH. Effects of using same- versus alternate-form memory tests during short-interval repeated assessments in multiple sclerosis. Journal of the International Neuropsychological Society. 2005;11:727–736. [PubMed]
  • Benedict RHB, Zgaljardic DL. Practice effects during repeated administrations of memory tests with and without alternate forms. Journal of Clinical and Experimental Neuropsychology. 1998;20:339–352. [PubMed]
  • Bennett GK, Seashore HG, Wesman AG. Differential Aptitude Test. San Antonio, TX: The Psychological Corporation; 1997.
  • Burke EF. A short note on the persistence of retest effects on aptitude scores. Journal of Occupational and Organizational Psychology. 1997;70:295–301.
  • Carroll JB. Human cognitive abilities: A survey of factor-analytic studies. NY: Cambridge University Press; 1993.
  • Chelune GJ, Naugle RI, Luders H, Sedlak J, Awad IA. Individual change after epilepsy surgery: Practice effects and base-rate information. Neuropsychology. 1993;7:41–52.
  • Cohen J, Cohen P, West SG, Aiken LS. Applied multiple regression/correlation analysis for the behavioral sciences. 3. Mahwah, NJ: Lawrence Erlbaum Associates; 2003.
  • Collie A, Maruff P, Darby DG, McStephen M. The effects of practice on the cognitive test performance of neurologically normal individuals assessed at brief test-retest intervals. Journal of the International Neuropsychological Society. 2003;9:419–428. [PubMed]
  • Coyle TR. Test-retest changes on scholastic aptitude tests are not related to g. Intelligence. 2006;34:15–27.
  • DeMonte VE, Geffen GM, Kwapil K. Test-retest reliability and practice effects of a rapid screen of mild traumatic brain injury. Journal of Experimental and Clinical Neuropsychology. 2005;27:624–632. [PubMed]
  • Dikmen SS, Heaton RK, Grant I, Temkin NR. Test-retest reliability and practice effects of expanded Halstead-Reitan Neuropsychological Test Battery. Journal of the International Psychological Society. 1999;5:346–356. [PubMed]
  • Duff K, Beglinger LJ, Schoenberg MR, Patton DE, Mold J, Scott JG, Adams RL. Test-retest stability and practice effects of the RBANS in a community dwelling elderly sample. Journal of Clinical and Experimental Neuropsychology. 2005;27:565–575. [PubMed]
  • Ekstrom RB, French JW, Harman HH, Dermen D. Manual for kit of factor-referenced cognitive tests. Princeton, NJ: Educational Testing Service; 1976.
  • Falleti MG, Maruff P, Collie A, Darby DG. Practice effects associated with the repeated assessment of cognitive function using the CogState Battery at 10-minute, one week and one month test-retest intervals. Journal of Clinical and Experimental Neuropsychology. 2006;28:1095–1112. [PubMed]
  • Ferrer E, Salthouse TA, McArdle JJ, Stewart WF, Schwartz BS. Multivariate modeling of age and retest in longitudinal studies of cognitive abilities. Psychology and Aging. 2005;20:412–422. [PMC free article] [PubMed]
  • Frerichs RJ, Tuokko HA. A comparison of methods for measuring cognitive change in older adults. Archives of Clinical Neuropsychology. 2005;20:321–333. [PubMed]
  • Folstein MF, Folstein SE, McHugh PR. Mini-mental state: A practical method for grading the cognitive state of patients for the clinician. Journal of Psychiatric Research. 1975;12:189–198. [PubMed]
  • Hausknecht JP, Halpert JA, Di Paolo NT, Gerrard MOM. Retesting in selection: A meta-analysis of coaching and practice effects for tests of cognitive ability. Journal of Applied Psychology. 2007;92:373–385. [PubMed]
  • Hausknecht JP, Trevor CO, Farr JL. Retaking ability tests in a selection setting: Implications for practice effects, training performance, and turnover. Journal of Applied Psychology. 2002;87:243–254. [PubMed]
  • Ivnik RJ, Smith GE, Lucas JA, Petersen RC, Boeve BF, Kokmen E, Tangalos EG. Testing normal older people three or four times at 1- to 2-year intervals: Defining normal variance. Neuropsychology. 1999;13:121–127. [PubMed]
  • Jensen A. The g factor: The science of mental ability. Westport, CT: Praeger; 1998.
  • Knight RG, McMahon J, Skeaff CM, Green TJ. Reliable change index scores for persons over the age of 65 tested on alternate forms of the Rey AVLT. Archives of Clinical Neuropsychology. 2007;22:513–518. [PubMed]
  • Lemay S, Bedard MA, Roulea I, Tremblay PLG. Practice effect and test-retest reliability of attentional and executive tests in middle-aged to elderly subjects. The Clinical Neuropsychologist. 2004;18:284–302. [PubMed]
  • Levine AJ, Miller EN, Becker JT, Selnes OA, Cohen BA. Normative data for determining significance of test-retest differences on eight common neuropsychological instruments. The Clinical Neuropsychologist. 2004;18:373–384. [PubMed]
  • Lowe C, Rabbitt P. Test/re-test reliability of the CANTAB and ISPOCD neuropsychological batteries: theoretical and practical issues. Neuropsychologia. 1998;36:915–923. [PubMed]
  • McArdle JJ, Nesselroade JR. Using multivariate data to structure developmental change. In: Coren SH, Reese HW, editors. Lifespan Developmental Psychology: Methodological Contributions. Hillsdale, N.J: Erlbaum; 1994. pp. 223–267.
  • McArdle JJ, Woodcock JR. Expanding test-rest designs to include developmental time-lag components. Psychological Methods. 1997;2:403–435.
  • Rapport LJ, Brines DB, Axelrod BN, Theisen ME. Full scale IQ as mediator of practice effects: The rich get richer. The Clinical Neuropsychologist. 1997;11:375–380.
  • Raven J. Advanced Progressive Matrices, Set II. London: H.K. Lewis; 1962.
  • Reeve CL, Lam H. The psychometric paradox of practice effects due to retesting: measurement invariance and stable ability estimates in the face of observed score changes. Intelligence. 2005;33:535–549.
  • Salinsky MC, Storzbach D, Dodrill CB, Binder LM. Test-retest bias, reliability, and regression equations for neuropsychological measures repeated over a 12–16-week period. Journal of International Neuropsychological Society. 2001;7:597–605. [PubMed]
  • Salthouse TA. Speed and knowledge as determinants of adult age differences in verbal tasks. Journal of Gerontology: Psychological Sciences. 1993;48:P29–P36. [PubMed]
  • Salthouse TA. Localizing age-related individual differences in a hierarchical structure. Intelligence. 2004;32:541–561. [PMC free article] [PubMed]
  • Salthouse TA. Relations between cognitive abilities and measures of executive functioning. Neuropsychology. 2005;19:532–545. [PubMed]
  • Salthouse TA. Implications of within-person variability in cognitive and neuropsychological functioning on the interpretation of change. Neuropsychology. 2007;21:401–411. [PMC free article] [PubMed]
  • Salthouse TA, Atkinson TM, Berish DE. Executive functioning as a potential mediator of age-related cognitive decline in normal adults. Journal of Experimental Psychology: General. 2003;132:566–594. [PubMed]
  • Salthouse TA, Babcock RL. Decomposing adult age differences in working memory. Developmental Psychology. 1991;27:763–776.
  • Salthouse TA, Berish DE, Siedlecki KL. Construct validity and age sensitivity of prospective memory. Memory & Cognition. 2004;32:1133–1148. [PubMed]
  • Salthouse TA, Ferrer-Caja E. What needs to be explained to account for age-related effects on multiple cognitive variables? Psychology and Aging. 2003;18:91–110. [PubMed]
  • Salthouse TA, Fristoe N, Rhee SH. How localized are age-related effects on neuropsychological measures? Neuropsychology. 1996;10:272–285.
  • Salthouse TA, Pink JE, Tucker-Drob EM. Contextual analysis of fluid intelligence. Intelligence in press.
  • Salthouse TA, Schroeder DH, Ferrer E. Estimating retest effects in longitudinal assessments of cognitive functioning in adults between 18 and 60 years of age. Developmental Psychology. 2004;40:813–822. [PubMed]
  • Salthouse TA, Siedlecki KL, Krueger LE. An individual differences analysis of memory control. Journal of Memory and Language. 2006;55:102–125. [PMC free article] [PubMed]
  • Schaie KW. Developmental Influences on Adult Intelligence: The Seattle Longitudinal Study. New York: Oxford University Press; 2005.
  • Stern Y. The concept of cognitive reserve: a catalyst for research. Journal of Clinical and Experimental Neuropsychology. 2003;25:589–593. [PubMed]
  • Theisen ME, Rapport LJ, Axelrod BN, Brines DB. Effects of practice in repeated administrations of the Wechsler Memory Scale-Revised in normal adults. Psychological Assessment. 1998;5:85–92. [PubMed]
  • Wechsler D. Wechsler Adult Intelligence Scale. 3. San Antonio, TX: The Psychological Corporation; 1997a.
  • Wechsler D. Wechsler Memory Scale. 3. San Antonio, TX: The Psychological Corporation; 1997b.
  • Wilson BA, Watson PC, Baddeley AD, Emslie H, Evans JJ. Improvement or simply practice? The effects of twenty repeated assessments on people with and without brain injury. Journal of the International Neuropsychological Society. 2000;6:469–479. [PubMed]
  • Woodcock RW, Johnson MB. Woodcock-Johnson Psycho-Educational Battery – Revised. Allen, TX: DLM; 1990.
  • Woods SP, Delis DC, Scott JC, Kramer JH, Holdnack JA. The California Verbal Learning Test – Second Edition. Test-retest reliability, practice effects, and reliable change indices for the standard and alternate forms. Archives of Clinical Neuropsychology. 2006;21:413–420. [PubMed]
  • Zachary RA. Shipley Institute of Living Scale – Revised. Los Angeles, CA: Western Psychological Services; 1986.
PubReader format: click here to try


Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...


  • PubMed
    PubMed citations for these articles

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...