Suppr超能文献

使用无监督聚类和有限人工标注的半自动真值生成:应用于手写字符识别

Semi-automatic ground truth generation using unsupervised clustering and limited manual labeling: Application to handwritten character recognition.

作者信息

Vajda Szilárd, Rangoni Yves, Cecotti Hubert

机构信息

National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA.

Henri Tudor Public Research Center, Kirchberg, L-1855, Luxembourg.

出版信息

Pattern Recognit Lett. 2015 Jun 1;58:23-28. doi: 10.1016/j.patrec.2015.02.001.

Abstract

For training supervised classifiers to recognize different patterns, large data collections with accurate labels are necessary. In this paper, we propose a generic, semi-automatic labeling technique for large handwritten character collections. In order to speed up the creation of a large scale ground truth, the method combines unsupervised clustering and minimal expert knowledge. To exploit the potential discriminant complementarities across features, each character is projected into five different feature spaces. After clustering the images in each feature space, the human expert labels the cluster centers. Each data point inherits the label of its cluster's center. A majority (or unanimity) vote decides the label of each character image. The amount of human involvement (labeling) is strictly controlled by the number of clusters - produced by the chosen clustering approach. To test the efficiency of the proposed approach, we have compared, and evaluated three state-of-the art clustering methods (k-means, self-organizing maps, and growing neural gas) on the MNIST digit data set, and a Lampung Indonesian character data set, respectively. Considering a k-nn classifier, we show that labeling manually only 1.3% (MNIST), and 3.2% (Lampung) of the training data, provides the same range of performance than a completely labeled data set would.

摘要

为了训练监督分类器以识别不同模式,需要有带准确标签的大数据集。在本文中,我们针对大型手写字符集提出了一种通用的半自动标注技术。为了加快大规模真实标注的创建,该方法将无监督聚类和最少的专家知识相结合。为了利用不同特征间潜在的判别互补性,每个字符被投影到五个不同的特征空间。在对每个特征空间中的图像进行聚类后,人类专家为聚类中心标注标签。每个数据点继承其所在聚类中心的标签。通过多数(或一致)投票来确定每个字符图像的标签。人工参与(标注)的量由所选聚类方法产生的聚类数量严格控制。为了测试所提方法的效率,我们分别在MNIST数字数据集和印尼楠榜语字符数据集上比较并评估了三种最先进的聚类方法(k均值、自组织映射和生长神经气体)。考虑一个k近邻分类器,我们表明仅对手动标注1.3%(MNIST)和3.2%(楠榜语)的训练数据,就能提供与完全标注的数据集相同的性能范围。

相似文献

6
Semi-supervised linear discriminant clustering.半监督线性判别聚类。
IEEE Trans Cybern. 2014 Jul;44(7):989-1000. doi: 10.1109/TCYB.2013.2278466. Epub 2013 Aug 27.
9
Unified Simultaneous Clustering and Feature Selection for Unlabeled and Labeled Data.针对未标记和已标记数据的统一同步聚类与特征选择
IEEE Trans Neural Netw Learn Syst. 2018 Dec;29(12):6083-6098. doi: 10.1109/TNNLS.2018.2818444. Epub 2018 Apr 20.

本文引用的文献

1
Angular pattern and binary angular pattern for shape retrieval.角度模式和二进制角度模式的形状检索。
IEEE Trans Image Process. 2014 Mar;23(3):1118-27. doi: 10.1109/TIP.2013.2286330. Epub 2013 Oct 18.
2
Script recognition--a review.脚本识别--综述。
IEEE Trans Pattern Anal Mach Intell. 2010 Dec;32(12):2142-61. doi: 10.1109/TPAMI.2010.30.
3
Learning context-sensitive shape similarity by graph transduction.通过图转换学习上下文敏感的形状相似性。
IEEE Trans Pattern Anal Mach Intell. 2010 May;32(5):861-74. doi: 10.1109/TPAMI.2009.85.
5
Reducing the dimensionality of data with neural networks.使用神经网络降低数据维度。
Science. 2006 Jul 28;313(5786):504-7. doi: 10.1126/science.1127647.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验