Vajda Szilárd, Rangoni Yves, Cecotti Hubert
National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA.
Henri Tudor Public Research Center, Kirchberg, L-1855, Luxembourg.
Pattern Recognit Lett. 2015 Jun 1;58:23-28. doi: 10.1016/j.patrec.2015.02.001.
For training supervised classifiers to recognize different patterns, large data collections with accurate labels are necessary. In this paper, we propose a generic, semi-automatic labeling technique for large handwritten character collections. In order to speed up the creation of a large scale ground truth, the method combines unsupervised clustering and minimal expert knowledge. To exploit the potential discriminant complementarities across features, each character is projected into five different feature spaces. After clustering the images in each feature space, the human expert labels the cluster centers. Each data point inherits the label of its cluster's center. A majority (or unanimity) vote decides the label of each character image. The amount of human involvement (labeling) is strictly controlled by the number of clusters - produced by the chosen clustering approach. To test the efficiency of the proposed approach, we have compared, and evaluated three state-of-the art clustering methods (k-means, self-organizing maps, and growing neural gas) on the MNIST digit data set, and a Lampung Indonesian character data set, respectively. Considering a k-nn classifier, we show that labeling manually only 1.3% (MNIST), and 3.2% (Lampung) of the training data, provides the same range of performance than a completely labeled data set would.
为了训练监督分类器以识别不同模式,需要有带准确标签的大数据集。在本文中,我们针对大型手写字符集提出了一种通用的半自动标注技术。为了加快大规模真实标注的创建,该方法将无监督聚类和最少的专家知识相结合。为了利用不同特征间潜在的判别互补性,每个字符被投影到五个不同的特征空间。在对每个特征空间中的图像进行聚类后,人类专家为聚类中心标注标签。每个数据点继承其所在聚类中心的标签。通过多数(或一致)投票来确定每个字符图像的标签。人工参与(标注)的量由所选聚类方法产生的聚类数量严格控制。为了测试所提方法的效率,我们分别在MNIST数字数据集和印尼楠榜语字符数据集上比较并评估了三种最先进的聚类方法(k均值、自组织映射和生长神经气体)。考虑一个k近邻分类器,我们表明仅对手动标注1.3%(MNIST)和3.2%(楠榜语)的训练数据,就能提供与完全标注的数据集相同的性能范围。