Suppr超能文献

探究DermaMNIST和Fitzpatrick17k皮肤病学图像数据集的质量

Investigating the Quality of DermaMNIST and Fitzpatrick17k Dermatological Image Datasets.

作者信息

Abhishek Kumar, Jain Aditi, Hamarneh Ghassan

机构信息

School of Computing Science, Simon Fraser University, Burnaby, V5A 1S6, Canada.

Department of Mathematics, Indian Institute of Technology Delhi, New Delhi, 110016, India.

出版信息

Sci Data. 2025 Feb 1;12(1):196. doi: 10.1038/s41597-025-04382-5.

Abstract

The remarkable progress of deep learning in dermatological tasks has brought us closer to achieving diagnostic accuracies comparable to those of human experts. However, while large datasets play a crucial role in the development of reliable deep neural network models, the quality of data therein and their correct usage are of paramount importance. Several factors can impact data quality, such as the presence of duplicates, data leakage across train-test partitions, mislabeled images, and the absence of a well-defined test partition. In this paper, we conduct meticulous analyses of three popular dermatological image datasets: DermaMNIST, its source HAM10000, and Fitzpatrick17k, uncovering these data quality issues, measure the effects of these problems on the benchmark results, and propose corrections to the datasets. Besides ensuring the reproducibility of our analysis, by making our analysis pipeline and the accompanying code publicly available, we aim to encourage similar explorations and to facilitate the identification and addressing of potential data quality issues in other large datasets.

摘要

深度学习在皮肤病学任务方面取得的显著进展使我们更接近于实现与人类专家相当的诊断准确率。然而,虽然大型数据集在可靠的深度神经网络模型的开发中起着关键作用,但其中数据的质量及其正确使用至关重要。有几个因素会影响数据质量,例如重复数据的存在、训练-测试分区之间的数据泄漏、图像标注错误以及缺乏明确的测试分区。在本文中,我们对三个流行的皮肤病学图像数据集:DermaMNIST、其来源HAM10000和Fitzpatrick17k进行了细致分析,揭示了这些数据质量问题,测量了这些问题对基准结果的影响,并提出了对数据集的修正。除了确保我们分析的可重复性,通过公开我们的分析管道和配套代码,我们旨在鼓励类似的探索,并促进在其他大型数据集中识别和解决潜在的数据质量问题。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f20d/11787307/0dd204923ab0/41597_2025_4382_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验