Suppr超能文献

"野外 UDE 硅藻 2024":一个用于训练深度学习模型的淡水硅藻新图像数据集。

"UDE DIATOMS in the Wild 2024": a new image dataset of freshwater diatoms for training deep learning models.

机构信息

Université de Lorraine, CNRS, LIEC, F-57000 Metz, France.

Georgia Tech Europe, CNRS IRL 2958, F-57000 Metz, France.

出版信息

Gigascience. 2024 Jan 2;13. doi: 10.1093/gigascience/giae087.

Abstract

BACKGROUND

Diatoms are microalgae with finely ornamented microscopic silica shells. Their taxonomic identification by light microscopy is routinely used as part of community ecological research as well as ecological status assessment of aquatic ecosystems, and a need for digitalization of these methods has long been recognized. Alongside their high taxonomic and morphological diversity, several other factors make diatoms highly challenging for deep learning-based identification using light microscopy images. These include (i) an unusually high intraclass variability combined with small between-class differences, (ii) a rather different visual appearance of specimens depending on their orientation on the microscope slide, and (iii) the limited availability of diatom experts for accurate taxonomic annotation.

FINDINGS

We present the largest diatom image dataset thus far, aimed at facilitating the application and benchmarking of innovative deep learning methods to the diatom identification problem on realistic research data, "UDE DIATOMS in the Wild 2024." The dataset contains 83,570 images of 611 diatom taxa, 101 of which are represented by at least 100 examples and 144 by at least 50 examples each. We showcase this dataset in 2 innovative analyses that address individual aspects of the above challenges using subclustering to deal with visually heterogeneous classes, out-of-distribution sample detection, and semi-supervised learning.

CONCLUSIONS

The problem of image-based identification of diatoms is both important for environmental research and challenging from the machine learning perspective. By making available the so far largest image dataset, accompanied by innovative analyses, this contribution will facilitate addressing these points by the scientific community.

摘要

背景

硅藻是一种具有精细装饰性微硅壳的微藻。它们的分类鉴定通过光学显微镜被常规用于群落生态研究以及水生生态系统的生态状况评估,并且长期以来一直需要将这些方法数字化。除了具有高度的分类和形态多样性外,还有其他几个因素使得基于光学显微镜图像的深度学习识别对硅藻来说极具挑战性。这些因素包括:(i)异常高的类内变异性与小的类间差异相结合,(ii)由于在显微镜载玻片上的取向不同,标本的外观相当不同,以及(iii)可用于准确分类注释的硅藻专家的有限可用性。

结果

我们提出了迄今为止最大的硅藻图像数据集,旨在促进创新的深度学习方法在现实研究数据上应用和基准测试,“UDE DIATOMS in the Wild 2024”。该数据集包含 611 个硅藻类别的 83570 张图像,其中 101 个类别至少有 100 个样本,144 个类别至少有 50 个样本。我们通过子聚类来处理视觉上异质的类别、离群样本检测和半监督学习,展示了这个数据集在 2 个创新分析中的应用,这些分析解决了上述挑战的各个方面。

结论

基于图像的硅藻识别问题对环境研究很重要,从机器学习的角度来看也具有挑战性。通过提供迄今为止最大的图像数据集,并结合创新的分析,本研究将有助于科学界解决这些问题。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4a4d/11604061/de3ba02333ff/giae087fig1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验