Cassidy Bill, Kendrick Connah, Brodzicki Andrzej, Jaworek-Korjakowska Joanna, Yap Moi Hoon
Manchester Metropolitan University, John Dalton Building, Chester Street, Manchester M1 5GD, UK.
AGH University of Science and Technology, Al Mickiewicza 30, 30-059 Krakow, Poland.
Med Image Anal. 2022 Jan;75:102305. doi: 10.1016/j.media.2021.102305. Epub 2021 Nov 16.
The International Skin Imaging Collaboration (ISIC) datasets have become a leading repository for researchers in machine learning for medical image analysis, especially in the field of skin cancer detection and malignancy assessment. They contain tens of thousands of dermoscopic photographs together with gold-standard lesion diagnosis metadata. The associated yearly challenges have resulted in major contributions to the field, with papers reporting measures well in excess of human experts. Skin cancers can be divided into two major groups - melanoma and non-melanoma. Although less prevalent, melanoma is considered to be more serious as it can quickly spread to other organs if not treated at an early stage. In this paper, we summarise the usage of the ISIC dataset images and present an analysis of yearly releases over a period of 2016 - 2020. Our analysis found a significant number of duplicate images, both within and between the datasets. Additionally, we also noted duplicates spread across testing and training sets. Due to these irregularities, we propose a duplicate removal strategy and recommend a curated dataset for researchers to use when working on ISIC datasets. Given that ISIC 2020 focused on melanoma classification, we conduct experiments to provide benchmark results on the ISIC 2020 test set, with additional analysis on the smaller ISIC 2017 test set. Testing was completed following the application of our duplicate removal strategy and an additional data balancing step. As a result of removing 14,310 duplicate images from the training set, our benchmark results show good levels of melanoma prediction with an AUC of 0.80 for the best performing model. As our aim was not to maximise network performance, we did not include additional steps in our experiments. Finally, we provide recommendations for future research by highlighting irregularities that may present research challenges. A list of image files with reference to the original ISIC dataset sources for the recommended curated training set will be shared on our GitHub repository (available at www.github.com/mmu-dermatology-research/isic_duplicate_removal_strategy).
国际皮肤影像协作组织(ISIC)数据集已成为医学图像分析机器学习领域研究人员的主要资源库,尤其是在皮肤癌检测和恶性程度评估领域。该数据集包含数万张皮肤镜照片以及金标准病变诊断元数据。相关的年度挑战赛为该领域做出了重大贡献,有论文报告的指标远超人类专家。皮肤癌可分为两大类——黑色素瘤和非黑色素瘤。黑色素瘤虽然发病率较低,但被认为更为严重,因为如果不及早治疗,它会迅速扩散到其他器官。在本文中,我们总结了ISIC数据集图像的使用情况,并对2016年至2020年期间的年度发布情况进行了分析。我们的分析发现数据集中存在大量重复图像,包括数据集内部和数据集之间。此外,我们还注意到重复图像分布在测试集和训练集中。由于这些不规范之处,我们提出了一种重复图像去除策略,并推荐一个经过整理的数据集供研究人员在处理ISIC数据集时使用。鉴于ISIC 2020专注于黑色素瘤分类,我们进行了实验,以在ISIC 2020测试集上提供基准结果,并对较小的ISIC 2017测试集进行额外分析。在应用我们的重复图像去除策略和额外的数据平衡步骤后完成了测试。从训练集中去除14310张重复图像后,我们的基准结果显示黑色素瘤预测水平良好,表现最佳的模型的AUC为0.80。由于我们的目的不是最大化网络性能,因此我们在实验中没有包括额外的步骤。最后,我们通过强调可能带来研究挑战的不规范之处,为未来的研究提供建议。推荐的经过整理的训练集的图像文件列表将参考原始ISIC数据集来源在我们的GitHub仓库(可在www.github.com/mmu-dermatology-research/isic_duplicate_removal_strategy上获取)上共享。