使用无监督聚类和有限人工标注的半自动真值生成：应用于手写字符识别

Semi-automatic ground truth generation using unsupervised clustering and limited manual labeling: Application to handwritten character recognition.

作者信息

Vajda Szilárd, Rangoni Yves, Cecotti Hubert

机构信息

National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA.

Henri Tudor Public Research Center, Kirchberg, L-1855, Luxembourg.

出版信息

Pattern Recognit Lett. 2015 Jun 1;58:23-28. doi: 10.1016/j.patrec.2015.02.001.

DOI:10.1016/j.patrec.2015.02.001

PMID:25870463

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4392711/

Abstract

For training supervised classifiers to recognize different patterns, large data collections with accurate labels are necessary. In this paper, we propose a generic, semi-automatic labeling technique for large handwritten character collections. In order to speed up the creation of a large scale ground truth, the method combines unsupervised clustering and minimal expert knowledge. To exploit the potential discriminant complementarities across features, each character is projected into five different feature spaces. After clustering the images in each feature space, the human expert labels the cluster centers. Each data point inherits the label of its cluster's center. A majority (or unanimity) vote decides the label of each character image. The amount of human involvement (labeling) is strictly controlled by the number of clusters - produced by the chosen clustering approach. To test the efficiency of the proposed approach, we have compared, and evaluated three state-of-the art clustering methods (k-means, self-organizing maps, and growing neural gas) on the MNIST digit data set, and a Lampung Indonesian character data set, respectively. Considering a k-nn classifier, we show that labeling manually only 1.3% (MNIST), and 3.2% (Lampung) of the training data, provides the same range of performance than a completely labeled data set would.

摘要

为了训练监督分类器以识别不同模式，需要有带准确标签的大数据集。在本文中，我们针对大型手写字符集提出了一种通用的半自动标注技术。为了加快大规模真实标注的创建，该方法将无监督聚类和最少的专家知识相结合。为了利用不同特征间潜在的判别互补性，每个字符被投影到五个不同的特征空间。在对每个特征空间中的图像进行聚类后，人类专家为聚类中心标注标签。每个数据点继承其所在聚类中心的标签。通过多数（或一致）投票来确定每个字符图像的标签。人工参与（标注）的量由所选聚类方法产生的聚类数量严格控制。为了测试所提方法的效率，我们分别在MNIST数字数据集和印尼楠榜语字符数据集上比较并评估了三种最先进的聚类方法（k均值、自组织映射和生长神经气体）。考虑一个k近邻分类器，我们表明仅对手动标注1.3%（MNIST）和3.2%（楠榜语）的训练数据，就能提供与完全标注的数据集相同的性能范围。

相似文献

Semi-automatic ground truth generation using unsupervised clustering and limited manual labeling: Application to handwritten character recognition.使用无监督聚类和有限人工标注的半自动真值生成：应用于手写字符识别

Pattern Recognit Lett. 2015 Jun 1;58:23-28. doi: 10.1016/j.patrec.2015.02.001.

Semi Supervised Learning with Deep Embedded Clustering for Image Classification and Segmentation.用于图像分类和分割的深度嵌入聚类半监督学习

IEEE Access. 2019;7:11093-11104. doi: 10.1109/ACCESS.2019.2891970. Epub 2019 Jan 9.

Learning fuzzy clustering for SPECT/CT segmentation via convolutional neural networks.通过卷积神经网络学习用于SPECT/CT分割的模糊聚类

Med Phys. 2021 Jul;48(7):3860-3877. doi: 10.1002/mp.14903. Epub 2021 May 28.

Local contrastive loss with pseudo-label based self-training for semi-supervised medical image segmentation.基于伪标签自训练的局部对比损失的半监督医学图像分割。

Med Image Anal. 2023 Jul;87:102792. doi: 10.1016/j.media.2023.102792. Epub 2023 Mar 11.

Semi-automatic data annotation based on feature-space projection and local quality metrics: An application to cerebral emboli characterization.基于特征空间投影和局部质量度量的半自动数据标注：在脑栓塞特征描述中的应用。

Med Image Anal. 2022 Jul;79:102437. doi: 10.1016/j.media.2022.102437. Epub 2022 Apr 1.

Semi-supervised linear discriminant clustering.半监督线性判别聚类。

IEEE Trans Cybern. 2014 Jul;44(7):989-1000. doi: 10.1109/TCYB.2013.2278466. Epub 2013 Aug 27.

Synergizing Deep Learning-Enabled Preprocessing and Human-AI Integration for Efficient Automatic Ground Truth Generation.协同深度学习驱动的预处理与人工-人工智能集成以高效自动生成真实标注

Bioengineering (Basel). 2024 Apr 28;11(5):434. doi: 10.3390/bioengineering11050434.

Semi-supervised Label Generation for 3D Multi-modal MRI Bone Tumor Segmentation.用于3D多模态MRI骨肿瘤分割的半监督标签生成

J Imaging Inform Med. 2025 Feb 20. doi: 10.1007/s10278-025-01448-z.

Unified Simultaneous Clustering and Feature Selection for Unlabeled and Labeled Data.针对未标记和已标记数据的统一同步聚类与特征选择

IEEE Trans Neural Netw Learn Syst. 2018 Dec;29(12):6083-6098. doi: 10.1109/TNNLS.2018.2818444. Epub 2018 Apr 20.

Hybrid manifold smoothing and label propagation technique for Kannada handwritten character recognition.用于卡纳达语手写字符识别的混合流形平滑与标签传播技术

Front Neurosci. 2024 Apr 12;18:1362567. doi: 10.3389/fnins.2024.1362567. eCollection 2024.

引用本文的文献

Feature Selection for Automatic Tuberculosis Screening in Frontal Chest Radiographs.基于 frontal chest radiographs 的自动肺结核筛查中的特征选择。

J Med Syst. 2018 Jun 29;42(8):146. doi: 10.1007/s10916-018-0991-9.

本文引用的文献

Angular pattern and binary angular pattern for shape retrieval.角度模式和二进制角度模式的形状检索。

IEEE Trans Image Process. 2014 Mar;23(3):1118-27. doi: 10.1109/TIP.2013.2286330. Epub 2013 Oct 18.

Script recognition--a review.脚本识别--综述。

IEEE Trans Pattern Anal Mach Intell. 2010 Dec;32(12):2142-61. doi: 10.1109/TPAMI.2010.30.

Learning context-sensitive shape similarity by graph transduction.通过图转换学习上下文敏感的形状相似性。

IEEE Trans Pattern Anal Mach Intell. 2010 May;32(5):861-74. doi: 10.1109/TPAMI.2009.85.

80 million tiny images: a large data set for nonparametric object and scene recognition.八千万张小图片：用于非参数化物体与场景识别的大型数据集。

IEEE Trans Pattern Anal Mach Intell. 2008 Nov;30(11):1958-70. doi: 10.1109/TPAMI.2008.128.

Reducing the dimensionality of data with neural networks.使用神经网络降低数据维度。

Science. 2006 Jul 28;313(5786):504-7. doi: 10.1126/science.1127647.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。