Gómez-Martínez Vanesa, Chushig-Muzo David, Veierød Marit B, Granja Conceição, Soguero-Ruiz Cristina
Department of Signal Theory and Communications, Telematics and Computing Systems, Rey Juan Carlos University, Madrid, 28943, Spain.
Oslo Centre for Biostatistics and Epidemiology, Department of Biostatistics, Institute of Basic Medical Sciences, University of Oslo, Oslo, Norway.
BioData Min. 2024 Oct 30;17(1):46. doi: 10.1186/s13040-024-00397-7.
Cutaneous melanoma is the most aggressive form of skin cancer, responsible for most skin cancer-related deaths. Recent advances in artificial intelligence, jointly with the availability of public dermoscopy image datasets, have allowed to assist dermatologists in melanoma identification. While image feature extraction holds potential for melanoma detection, it often leads to high-dimensional data. Furthermore, most image datasets present the class imbalance problem, where a few classes have numerous samples, whereas others are under-represented.
In this paper, we propose to combine ensemble feature selection (FS) methods and data augmentation with the conditional tabular generative adversarial networks (CTGAN) to enhance melanoma identification in imbalanced datasets. We employed dermoscopy images from two public datasets, PH2 and Derm7pt, which contain melanoma and not-melanoma lesions. To capture intrinsic information from skin lesions, we conduct two feature extraction (FE) approaches, including handcrafted and embedding features. For the former, color, geometric and first-, second-, and higher-order texture features were extracted, whereas for the latter, embeddings were obtained using ResNet-based models. To alleviate the high-dimensionality in the FE, ensemble FS with filter methods were used and evaluated. For data augmentation, we conducted a progressive analysis of the imbalance ratio (IR), related to the amount of synthetic samples created, and evaluated the impact on the predictive results. To gain interpretability on predictive models, we used SHAP, bootstrap resampling statistical tests and UMAP visualizations.
The combination of ensemble FS, CTGAN, and linear models achieved the best predictive results, achieving AUCROC values of 87% (with support vector machine and IR=0.9) and 76% (with LASSO and IR=1.0) for the PH2 and Derm7pt, respectively. We also identified that melanoma lesions were mainly characterized by features related to color, while not-melanoma lesions were characterized by texture features.
Our results demonstrate the effectiveness of ensemble FS and synthetic data in the development of models that accurately identify melanoma. This research advances skin lesion analysis, contributing to both melanoma detection and the interpretation of main features for its identification.
皮肤黑色素瘤是最具侵袭性的皮肤癌形式,导致了大多数与皮肤癌相关的死亡。人工智能的最新进展,加上公开的皮肤镜图像数据集的可用性,使得在黑色素瘤识别方面能够协助皮肤科医生。虽然图像特征提取在黑色素瘤检测方面具有潜力,但它往往会导致高维数据。此外,大多数图像数据集存在类别不平衡问题,即少数类别有大量样本,而其他类别样本不足。
在本文中,我们建议将集成特征选择(FS)方法和数据增强与条件表格生成对抗网络(CTGAN)相结合,以增强不平衡数据集中黑色素瘤的识别。我们使用了来自两个公共数据集PH2和Derm7pt的皮肤镜图像,其中包含黑色素瘤和非黑色素瘤病变。为了从皮肤病变中捕捉内在信息,我们进行了两种特征提取(FE)方法,包括手工制作的特征和嵌入特征。对于前者,提取了颜色、几何形状以及一阶、二阶和高阶纹理特征,而对于后者,使用基于ResNet的模型获得嵌入特征。为了减轻特征提取中的高维性,使用并评估了带有过滤方法的集成FS。对于数据增强,我们对与创建的合成样本数量相关的不平衡率(IR)进行了渐进分析,并评估了其对预测结果的影响。为了获得预测模型的可解释性,我们使用了SHAP、自助重采样统计测试和UMAP可视化。
集成FS、CTGAN和线性模型的组合取得了最佳预测结果,PH2和Derm7pt的AUCROC值分别达到87%(支持向量机,IR = 0.9)和76%(LASSO,IR = 1.0)。我们还发现黑色素瘤病变主要以与颜色相关的特征为特征,而非黑色素瘤病变以纹理特征为特征。
我们的结果证明了集成FS和合成数据在开发准确识别黑色素瘤的模型中的有效性。这项研究推动了皮肤病变分析,有助于黑色素瘤检测及其识别主要特征的解释。