Sort by
Items per page

Send to

Choose Destination

Search results

Items: 8

  • Filters activated: Field: Title Word. Clear all
IEEE Trans Image Process. 2016 Jun;25(6):2557-2572. doi: 10.1109/TIP.2016.2544703. Epub 2016 Mar 21.

An Empirical Study Into Annotator Agreement, Ground Truth Estimation, and Algorithm Evaluation.


Although agreement between the annotators who mark feature locations within images has been studied in the past from a statistical viewpoint, little work has attempted to quantify the extent to which this phenomenon affects the evaluation of foreground-background segmentation algorithms. Many researchers utilize ground truth (GT) in experimentation and more often than not this GT is derived from one annotator's opinion. How does the difference in opinion affects an algorithm's evaluation? A methodology is applied to four image-processing problems to quantify the interannotator variance and to offer insight into the mechanisms behind agreement and the use of GT. It is found that when detecting linear structures, annotator agreement is very low. The agreement in a structure's position can be partially explained through basic image properties. Automatic segmentation algorithms are compared with annotator agreement and it is found that there is a clear relation between the two. Several GT estimation methods are used to infer a number of algorithm performances. It is found that the rank of a detector is highly dependent upon the method used to form the GT, and that although STAPLE and LSML appear to represent the mean of the performance measured using individual annotations, when there are few annotations, or there is a large variance in them, these estimates tend to degrade. Furthermore, one of the most commonly adopted combination methods-consensus voting-accentuates more obvious features, resulting in an overestimation of performance. It is concluded that in some data sets, it is not possible to confidently infer an algorithm ranking when evaluating upon one GT.

Stud Hist Philos Sci. 2014 Jun;46:64-72.

Empirical progress and nomic truth approximation revisited.


In my From Instrumentalism to Constructive Realism (2000) I have shown how an instrumentalist account of empirical progress can be related to nomic truth approximation. However, it was assumed that a strong notion of nomic theories was needed for that analysis. In this paper it is shown, in terms of truth and falsity content, that the analysis already applies when, in line with scientific common sense, nomic theories are merely assumed to exclude certain conceptual possibilities as nomic possibilities.

[Indexed for MEDLINE]

Estimation of the prior distribution of ground truth in the STAPLE algorithm: an empirical Bayesian approach.

Author information

Computational Radiology Laboratory, Department of Radiology, Children's Hospital, 300 Longwood Avenue, Boston, MA 02115, USA.


We present a new fusion algorithm for the segmentation and parcellation of magnetic resonance (MR) images of the brain. Our algorithm is a parametric empirical Bayesian extension of the STAPLE algorithm which uses the observations to accurately estimate the prior distribution of the hidden ground truth using an expectation maximization (EM) algorithm. We use IBSR dataset for the evaluation of our fusion algorithm. We segment 128 principle gray and white matter structures of the brain using our novel method and eight other state-of-the-art algorithms in the literature. Our prior distribution estimation strategy improves the accuracy of the fusion algorithm. It was shown that our new fusion algorithm has superior performance compared to the other state-of-the-art fusion methods in the literature.

[Indexed for MEDLINE]
Free PMC Article
Icon for PubMed Central
Acad Radiol. 2001 Oct;8(10):947-54.

The "differential diagnosis" for multiple diseases: comparison with the binary-truth state experiment in two empirical studies.

Author information

Department of Biostatistics and Epidemiology, The Cleveland Clinic Foundation, OH 44195, USA.



In practice readers must often choose between multiple diagnoses. For assessing reader accuracy in these settings. Obuchowski et al have proposed the "differential diagnosis" method, which derives all pairwise estimates of accuracy for the various diagnoses, along with summary measures of accuracy. The current study assessed the correspondence between the differential diagnosis method and conventional binary-truth state experiments.


Two empirical studies were conducted at two institutions with different readers and diagnostic tests. Readers used the differential diagnosis format to interpret a set of cases. In subsequent readings they interpreted the cases in binary-truth state experiments. Spearman rank correlation coefficients and the percentages of agreement in scores were computed, and the areas under the receiver operating characteristic curves were estimated and compared.


The between-format Spearman rank correlation coefficients were 0.697-0.718 and 0.750-0.780 for the two studies; the between-reader correlations were 0.417 and 0.792, respectively. The percentages of agreement between formats for the two studies were 50.0%-51.7% and 72.9%-78.8%; the percentages of agreement between readers were 45.0% and 80%, respectively. In the first study there were several significant differences in the areas under receiver operating characteristic curves; in the second study these differences were small.


The differences observed between the two formats can be attributed to within-reader variability and inherent differences in the questions posed to readers in the multiple-diagnoses versus binary-truth state reading sessions. The differential diagnosis format is useful for estimating accuracy when there are multiple possible diagnoses.

[Indexed for MEDLINE]
Icon for Elsevier Science
J Clin Psychol. 2000 Dec;56(12):1615-21.

The DSM classification of personality disorder: clinical wisdom or empirical truth? A response to Alvin R. Mahrer's problem 11.

Author information

Department of Psychology, University of Virginia, Charlottesville 22904-4400, USA.


In a recent issue of the Journal of Clinical Psychology Alvin R. Mahrer (1999, pp. 1147-1156) outlined 11 problems facing the field of psychotherapy. Problem 11 states that psychotherapy rests on a foundation of truths that have not been tested in ways that could find them to be false. This point is especially pertinent to the DSM classification of personality disorders. None of the DSM-IV categories of personality disorder were discovered through empirical research. Rather, they were hypothesized by psychiatrists and psychologists, and put into writing. A review of the relevant literature reveals that empirical research findings rarely agree with the DSM conceptions of personality disorder classification. It is concluded that only through continuing empirical research efforts can we discover how nature "intended" personality pathology to be classified.

Supplemental Content

Loading ...
Support Center