Kappa statistics to measure interrater and intrarater agreement for 1790 cervical biopsy specimens among twelve pathologists: qualitative histopathologic analysis and methodologic issues

Gynecol Oncol. 2005 Dec;99(3 Suppl 1):S38-52. doi: 10.1016/j.ygyno.2005.07.040. Epub 2005 Sep 23.

Abstract

Background: As part of a program project to evaluate emerging optical technologies for cervical neoplasia, we performed fluorescence and reflectance spectroscopic examinations of patients with abnormal Papanicolaou smears. Biopsy specimens were taken from each area and measured optically, and study pathologists performed qualitative histopathologic readings. Several methodologic issues arose in this analysis: (1) the interpathologist and intrapathologist agreement between institutions for the 1790 biopsy specimens; (2) the interinstitutional agreement among the two institutions conducting the trials on 117 randomly chosen biopsy specimens; (3) the interinstitutional agreement among the two institutions and a third expert gynecologic pathologist to ensure the expert readings were comparable to those outside both institutions on 117 randomly chosen biopsy specimens; and (4) an additional three reviews of the 106 difficult biopsy specimens by all three institutions.

Methods: All 1790 specimens from 850 patients were reviewed three times at each institution in blinded fashion; those for which the first and second reviews were identical were not reviewed a third time. A randomly selected sample of 117 specimens was randomly ordered and read by study pathologists at The University of Texas M. D. Anderson Cancer Center, British Columbia Cancer Agency (BCCA), and Brigham and Women's Hospital (BWH). The 106 difficult cases were treated in the same manner as the randomized and random-ordered cases. Generalized, unweighted, and weighted kappas and their 95% confidence intervals were used to assess agreement. Binary comparisons were used to compare diagnostic categories.

Findings: The kappas for the three readings of the overall data set using eight-category World Health Organization (WHO) criteria were as follows: 0.66 for the generalized, 0.72 for weighted, and ranged from 0.59 to 0.94 unweighted binary categories; those read using four-category Bethesda criteria: 0.70 for generalized, 0.69 for weighted, and 0.56-0.94 for unweighted binary categories. For the pool versus the study pathologist readings, the eight-category kappa was 0.51 for generalized, 0.72 for weighted, and 0.56-0.82 for unweighted binary categories; for those read using Bethesda criteria: 0.70 for generalized, 0.70 for weighted, and 0.59-0.82 for the unweighted binary categories. The interpathologist and intrapathologist readings were fair by Landis standards at the low end of the diagnostic scale (atypia, human papillomavirus, and CIN1) and substantial to almost perfect at the high end (CIN2, CIN3, and CIS). The randomly selected and randomly ordered sample of 117 specimens read with the WHO system yielded a generalized kappa of 0.45; among the three institutions (M. D. Anderson Cancer Center vs. BCCA, M. D. Anderson vs. BWH, and BCCA vs. BWH), the unweighted kappas were 0.46, 0.41, and 0.49 and the weighted were 0.65, 0.66, and 0.68, respectively; for the Bethesda, a generalized kappa of 0.65, unweighted kappas of 0.66, 0.65, and 0.47, and weighted of 0.74, 0.72, and 0.74. The difficult specimens read with the WHO system yielded a generalized kappa of 0.23; among the three institutions the unweighted kappas were 0.20, 0.30, and 0.37, and the weighted were 0.17, 0.34, and 0.31; for the Bethesda, a generalized kappa of 0.25; among the three institutions, the unweighted kappas were 0.21, 0.32, and 0.37, and the weighted were: 0.07, 0.21, and 0.37, respectively.

Interpretation: Kappas in this expert group of pathologists were in the moderate, substantial, and almost perfect ranges for the overall and randomized samples. The randomized sample was representative of the larger sample. The kappa of the specimens for which disagreements arose was, predictably, in the slight range. Our findings will aid both the correlations with optical measurements using fluorescence and reflectance spectroscopy and the quantitative histopathologic analysis of these study specimens.

Publication types

  • Comparative Study
  • Multicenter Study
  • Research Support, N.I.H., Extramural

MeSH terms

  • Biopsy
  • Cervix Uteri / pathology*
  • Female
  • Histological Techniques / methods
  • Histological Techniques / standards
  • Humans
  • Observer Variation
  • Reproducibility of Results
  • Spectrometry, Fluorescence / methods
  • Statistics as Topic / methods*
  • Uterine Cervical Dysplasia / classification
  • Uterine Cervical Dysplasia / diagnosis
  • Uterine Cervical Dysplasia / pathology*
  • Uterine Cervical Neoplasms / classification
  • Uterine Cervical Neoplasms / diagnosis
  • Uterine Cervical Neoplasms / pathology*