Format

Send to

Choose Destination
Pattern Recognit Lett. 2015 Jun 1;58:23-28.

Semi-automatic ground truth generation using unsupervised clustering and limited manual labeling: Application to handwritten character recognition.

Author information

1
National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA.
2
Henri Tudor Public Research Center, Kirchberg, L-1855, Luxembourg.
3
Faculty of Computing and Engineering, University of Ulster, Londonderry, BT48 7JL, Northern Ireland, UK.

Abstract

For training supervised classifiers to recognize different patterns, large data collections with accurate labels are necessary. In this paper, we propose a generic, semi-automatic labeling technique for large handwritten character collections. In order to speed up the creation of a large scale ground truth, the method combines unsupervised clustering and minimal expert knowledge. To exploit the potential discriminant complementarities across features, each character is projected into five different feature spaces. After clustering the images in each feature space, the human expert labels the cluster centers. Each data point inherits the label of its cluster's center. A majority (or unanimity) vote decides the label of each character image. The amount of human involvement (labeling) is strictly controlled by the number of clusters - produced by the chosen clustering approach. To test the efficiency of the proposed approach, we have compared, and evaluated three state-of-the art clustering methods (k-means, self-organizing maps, and growing neural gas) on the MNIST digit data set, and a Lampung Indonesian character data set, respectively. Considering a k-nn classifier, we show that labeling manually only 1.3% (MNIST), and 3.2% (Lampung) of the training data, provides the same range of performance than a completely labeled data set would.

Supplemental Content

Full text links

Icon for PubMed Central
Loading ...
Support Center