Perspective on chemicals and health.

The ECOG Scale of Performance Status (PS) is widely used to quantify the functional status of cancer patients, and is an important factor determining prognosis in a number of malignant conditions. The PS describes the status of symptoms and functions with respect to ambulatory status and need for care. PS 0 means normal activity, PS 1 means some symptoms, but still near fully ambulatory, PS 2 means less than 50%, and PS 3 means more than 50% of daytime in bed, while PS 4 means completely bedridden. An inter-observer variability study of PS assessment has been carried out to evaluate the non-chance agreement among three oncologists rating 100 consecutive cancer patients. Total unanimity was observed in 40 cases, unanimity between two observers in 53 cases, and total disagreement in seven cases. Kappa statistics reveal the ability of the observers compared to change alone and were used to evaluate non-chance agreement. Overall Kappa was

Performance status (PS) is an assessment of the patients' actual level of function and capability of self-care. It has repeatedly been demonstrated that PS is an important prognostic factor for survival in several major cancer forms, e.g. breast cancer (Swenerton et al., 1979), ovarian cancer (Lund et al., 1990), small cell lung cancer (0sterlind & Andersen, 1986), and non-small cell lung cancer (S0rensen et al., 1989). Accordingly, PS must be taken into consideration in the planning and evaluation of clinical trials of cancer treatment. It has also been suggested that PS might be used as part of the assessment of the patients' quality of life (Ganz et al., 1988).
Several scales for measuring PS have been suggested, among which the most widely used are Karnofsky's Scale of Performance Status (Karnofsky et al., 1948), and ECOG Scale of Performance Status (Zubrod et al., 1960). In spite of their common use there is only limited information about the validity and reliability of these scales.
The validity relates to whether the scale actually measures the intended subject, while reliability deals with the degree of confidence we have in the individual measurements, and is often described as intraand inter-observer variability.
Only few previous trials have evaluated the validity of Karnofsky Performance Status Scale (Mor et al., 1984;Wood et al., 1981;Schag et al., 1984), and studies on the ECOG Scale are even more sparse.
No intra-observer variability analysis has been reported for any of the PS scales. In contrast, three previous papers report on the inter-observer variability in the use of Karnofsky Performance Status Scale (Schag et al., 1984;Yates et al., 1980;Hutchinson et al., 1979). No (Conill et al., 1990). Yates et al. (Yates et al., 1980) evaluated the Karnofsky Scale with respect to inter-observer variability between nurses and social workers and found a correlation coefficient of 0.69. Poor correlation among PS assessment by doctors was reported by Hutchinson et al. (Hutchinson et al., 1979), who Correspondence: J. Benn S0rensen. Received 21 April 1992; and in revised form 12 October 1992. included patients requiring hemodialysis or patients admitted to the emergency room. This might influence the results, as the Karnofsky Scale was originally designed for use in cancer patients.
Agreement between oncologists on the one hand and psychologists or psychiatrists on the other was evaluated by Schag et al. (Schag et al., 1984) in 75 cancer patients. They found a correlation coefficient of 0.89. The question of agreement among oncologists was not evaluated. This issue has been addressed in one study (Conill et al., 1990), though not with the use of Kappa statistics for the evaluation of nonchance agreement. The purpose of the present study was to evaluate the reliability of the ECOG Scale of Performance Status by measuring the non-chance agreement between three oncologists.

Materials and methods
The three observers were oncologists working in the clinic who otherwise had no specific training for the actual project.
The patient population was 100 consecutive in-patients at the clinic seen on randomly selected days during the 3-month study period. The patients were included, after informed conset had been obtained. Each observer interviewed the patient on the same day, usually within a 3-h period, and scored the patient according to the ECOG Scale of Performance Status (Table I). Each observer was blinded for the rating of the other observers and did not see the hospital records before rating. Symptoms, but nearly fully ambulatory 2 Some bed time, but needs to be in bed less than 50% of normal daytime 3 Needs to be in bed greater than 50% of normal daytime 4 Unable to get out of bed The inter-observer agreement was evaluated using Kappa statistics. Kappa is a coefficient of interjudge agreement for nominal and ordinal scales (Cohen, J, 1960). It is directly interpretable as the proportion of joint judgements in which there is agreement, after chance agreement is excluded. Thus, Kappa value indicates how much better the observers are as compared to chance alone, and varies between -1 and + 1. Kappa = + 1 means full agreement, Kappa = 0 indicates that the agreement can be explained solely by chance, and Kappa < 0 is found when the observed agreement is less than expected by chance (Cohen, J, 1960;Fleiss et al., 1979). Kappa values above 0.40 indicate fair agreement, while values above 0.70 point towards good agreement.

Results
Characteristics of the 100 consecutive patients who entered the study are given in Table II. There was an equal sex distribution in the study. Approximately equal numbers of patients had small cell lung cancer, ovarian cancer, testicular cancer, and other malignant diseases. The distribution of ECOG scores for the individual observers is shown in Table III. The number of patients considered to have ECOG score 0 and 1 was similar among the observers. Larger differences were, however, noted with respect to ECOG scores 2, 3, and 4 (Table III). Unanimity between all three observers was observed in 40 cases, unanimity between two of the observers in 53 cases, and total disagreement in seven cases.
Kappa statistics on the inter-observer variability are shown in Table IV. Overall Kappa in ECOG scoring between the three observers was 0.44 (95% confidence limits 0.38-0.51). The Kappa value was higher in the best performance status groups, ECOG score 0 having a Kappa value of 0.55 (95% confidence limits 0.44-0.67), and lowest in ECOG score 4, with a Kappa value of 0.33 (95% confidence limits 0.22-0.45). This points towards a lower agreement among observers in evaluation of patients in the poor performance status groups than the less affected patients.
The Kappa values given above were not weighted according to the degree of disagreement, i.e. the more serious disagreements when observers rank the same patient as hav-  ing PS 0 and PS 4, respectively, were given the same weight in the analysis as the less serious event when a patient was ranked as PS 0 and PS 1, respectively.
However, given the considerable variations in ECOG scoring between the observers, we analysed the results further by dividing the scores into only two groups: ECOG score 0-2 versus ECOG score 3-4 (Table V). This dividing point was chosen because it is often used as a cut-off point for inclusion of patients in clinical studies of experimental treatments.
If one observer allocated a patient to ECOG score 0, 1, or 2, then another randomly selected observer would place the patient in the same category (ECOG 0, 1, or 2) with a probability of 0.92. Thus the proportion of second observers in the study whose assessment agreed with that of the first observer was 0.92. For patients in ECOG group 3 to 4, the same category (ECOG 3 or 4) was chosen with a probability of 0.82.

Discussion
In the interpretation of data on agreement rates, the original distribution of the phenomenon in the study with respect to frequency of normality and abnormality is crucial, and must be taken into consideration. This hampers comparison between studies, since an overall agreement rate is the sum of agreements about normality and abnormality. An overall agreement rate of 80% may mean 40% of cases agreed as normal and 40% agreed as abnormal, or 75% of cases agreed as normal and 5% agreed as abnormal. The latter outcome is much easier to achieve (Koran, 1975). By use of Kappa statistics, it is possible to calculate the non-chance agreement, and this standardised measure may be compared between studies.
Also, in the inter-observer variability within a study, the agreement rate of two equally skilled physicians regarding the presence of an abnormality in a series of cases is, in part, a function of the proportion of cases each physician consideres abnormal (Spitzer & Fleiss, 1974). If two physicians each consider half the cases abnormal, they will agree 25% of the time by chance alone. If they each consider 80% of the cases abnormal, they will agree 64% of the time by chance alone. Hence, since the proportion of abnormal cases varies across studies, the level of agreement by chance varies across studies. Thus, an overall agreement rate of 80% in one study may be 55% more than chance expectation, whereas in a second study it is only 16% more than chance expectation (Spitzer & Fleiss, 1974).
This phenomenon is reflected in the current study, in which only hospitalised patients were examined, with more than 40% of the patients having ECOG score 2 or worse, and 21% to 31% of cases having ECOG score 3 or 4. This resulted in an overall agreement of 40% among three observers. In contrast, Conill et al. (Conill et al., 1990) examined a patient population with a lower frequency of severe abnormalities, as 79% to 81% of their patients had ECOG Performance Status 0 or 1, and only 5% to 7% had PS 3 or 4. There was a coincidence degree of 59% in the rating according to ECOG Performance Status by the two observers. Contributing to the difference in agreement rates between the present study (40% agreement) and the study by Conill et al. (59%) is also the different number of observers, as the more observers, the lower the inter-observer agreement rate would be expected to be (Segall, 1960;Conn & Spencer, 1972). The order in which the observers interviewed patients for assessment of performance status may theoretically introduce systematic biases if the order is not completely randomized. In the current study, observer 1 had universally performed the first interview with patient, while the order of interview by observer 2 and 3 was random. However, the similar distribution of patients in the respective performance status classes achieved by the three observers as shown in Table III does not point towards any systematic bias in the results.
Performance status has in recent years become a tool in the growing field of psychometric testing of scales for measuring quality of life among cancer patients. In the pro-cess of developing such scales the validity is often tested by comparing the results from rating on the quality of life scale to the PS assessment, e.g. as described by Schipper et al. (Schipper et al., 1984) in the evaluation of the Functional Living Index -Cancer. This new use of the PS assessment obviously necessitates a solid knowledge of the validity and the reliability of the PS scale itself.
While PS assessment has traditionally been performed by physicians, this is not necessarily the case in scales measuring quality of life. Slevin et al. conclude from a questionnaire study that a reliable and consistent method of measuring quality of life in cancer patients must come from the patients themselves (Slevin et al., 1988). Given this difference with respect to observers of the scale, together with the interobservers variability in PS assessment, one should not expect correlations between quality of life scales and PS to be very high, even if the former scales were valid.
In conclusion, the reliability of the ECOG Scale of Performance Status, as evaluated by the degree of inter-observer variability, is fair, offering additional support for its widespread use, for example as a prognostic factor or as an inclusion criterium for entry into clinical trials. Agreement is higher in patients with good performance status than in more affected patients, a fact which must be born in mind in the interpretation of data from trials dealing with this topic.