Reducing annotation effort in agricultural data: simple and fast unsupervised coreset selection with DINOv2 and K-means.

作者信息

Gómez-Zamanillo Laura, Portilla Nagore, Picón Artzai, Egusquiza Itziar, Navarra-Mestre Ramón, Elola Andoni, Bereciartua-Perez Arantza

机构信息

TECNALIA, Basque Research and Technology Alliance, Parque Tecnológico de Bizkaia, Derio, Bizkaia, Spain.

University of the Basque Country, Bilbao, Bizkaia, Spain.

出版信息

Front Plant Sci. 2025 May 14;16:1546756. doi: 10.3389/fpls.2025.1546756. eCollection 2025.

DOI:10.3389/fpls.2025.1546756

PMID:40438735

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12116677/

Abstract

The need for large amounts of annotated data is a major obstacle to adopting deep learning in agricultural applications, where annotation is typically time-consuming and requires expert knowledge. To address this issue, methods have been developed to select data for manual annotation that represents the existing variability in the dataset, thereby avoiding redundant information. Coreset selection methods aim to choose a small subset of data samples that best represents the entire dataset. These methods can therefore be used to select a reduced set of samples for annotation, optimizing the training of a deep learning model for the best possible performance. In this work, we propose a simple yet effective coreset selection method that combines the recent foundation model DINOv2 as a powerful feature selector with the well-known K-Means clustering method. Samples are selected from each calculated cluster to form the final coreset. The proposed method is validated by comparing the performance metrics of a multiclass classification model trained on datasets reduced randomly and using the proposed method. This validation is conducted on two different datasets, and in both cases, the proposed method achieves better results, with improvements of up to 0.15 in the F1 score for significant reductions in the training datasets. Additionally, the importance of using DINOv2 as a feature extractor to achieve these good results is studied.

摘要

相似文献

Reducing annotation effort in agricultural data: simple and fast unsupervised coreset selection with DINOv2 and K-means.

Front Plant Sci. 2025 May 14;16:1546756. doi: 10.3389/fpls.2025.1546756. eCollection 2025.

DINOV2-FCS: a model for fruit leaf disease classification and severity prediction.DINOV2-FCS：一种用于果树叶部病害分类和严重程度预测的模型。

Front Plant Sci. 2024 Dec 6;15:1475282. doi: 10.3389/fpls.2024.1475282. eCollection 2024.

Label-efficient sequential model-based weakly supervised intracranial hemorrhage segmentation in low-data non-contrast CT imaging.低数据量非增强CT成像中基于标签高效序列模型的弱监督颅内出血分割

Med Phys. 2025 Apr;52(4):2123-2144. doi: 10.1002/mp.17689. Epub 2025 Feb 17.

A Unified Approach to Coreset Learning.一种用于核心集学习的统一方法。

IEEE Trans Neural Netw Learn Syst. 2024 May;35(5):6893-6905. doi: 10.1109/TNNLS.2022.3213169. Epub 2024 May 2.

Semi-automatic data annotation based on feature-space projection and local quality metrics: An application to cerebral emboli characterization.基于特征空间投影和局部质量度量的半自动数据标注：在脑栓塞特征描述中的应用。

Med Image Anal. 2022 Jul;79:102437. doi: 10.1016/j.media.2022.102437. Epub 2022 Apr 1.

An Annotation Sparsification Strategy for 3D Medical Image Segmentation via Representative Selection and Self-Training.一种基于代表性选择和自训练的3D医学图像分割的注释稀疏化策略

Proc AAAI Conf Artif Intell. 2020 Feb;34(44):6925-6932. doi: 10.1609/aaai.v34i04.6175. Epub 2020 Apr 3.

On the objectivity, reliability, and validity of deep learning enabled bioimage analyses.深度学习赋能的生物影像分析的客观性、可靠性和有效性。

Elife. 2020 Oct 19;9:e59780. doi: 10.7554/eLife.59780.

Synthetic Data-Driven Approaches for Chinese Medical Abstract Sentence Classification: Computational Study.用于中文医学摘要句子分类的合成数据驱动方法：计算研究

JMIR Form Res. 2025 Mar 19;9:e54803. doi: 10.2196/54803.

CMC-Net: 3D calf muscle compartment segmentation with sparse annotation.CMC-Net：基于稀疏标注的 3D 小腿肌肉解剖分割

Med Image Anal. 2022 Jul;79:102460. doi: 10.1016/j.media.2022.102460. Epub 2022 Apr 21.

Application of information theoretic feature selection and machine learning methods for the development of genetic risk prediction models.信息论特征选择和机器学习方法在遗传风险预测模型开发中的应用。

Sci Rep. 2021 Dec 2;11(1):23335. doi: 10.1038/s41598-021-00854-x.

本文引用的文献

Analysis of Few-Shot Techniques for Fungal Plant Disease Classification and Evaluation of Clustering Capabilities Over Real Datasets.用于真菌植物病害分类的少样本技术分析及对真实数据集聚类能力的评估

Front Plant Sci. 2022 Mar 7;13:813237. doi: 10.3389/fpls.2022.813237. eCollection 2022.

Improving Semantic Segmentation via Efficient Self-Training.通过高效自训练改进语义分割

IEEE Trans Pattern Anal Mach Intell. 2024 Mar;46(3):1589-1602. doi: 10.1109/TPAMI.2021.3138337. Epub 2024 Feb 6.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验