一种用于早期皮肤病检测的迭代混合多模态深度学习方法的设计，该方法采用交叉注意力和基于图的融合技术。

Design of an iterative hybrid multimodal deep learning method for early skin disease detection with cross-attention and graph-based fusions.

作者信息

Shivasree Yerrabati, RaviSankar V

机构信息

Department of Computer Science & Engineering, GITAM Deemed to be University, Hyderabad, Telangana, India.

出版信息

MethodsX. 2025 Aug 24;15:103584. doi: 10.1016/j.mex.2025.103584. eCollection 2025 Dec.

DOI:10.1016/j.mex.2025.103584

PMID:40949827

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12423411/

Abstract

This study proposes an end-to-end multimodal learning framework for early skin disease detection, incorporating both their spatial, temporal, and semantic information across heterogeneous patient data. The framework is composed of three key modules: (i)EfficientNet-B4 that extracts rich visual features from dermoscopic images, (ii) aBiLSTM enhanced with temporal attentionto model symptom evolvement from sensor-based time-series signals, and (iii) ClinicalBERT, a domain-specific transformer that generates contextual embeddings from patient clinical narratives. Modality-specific features are combined with a multi-head cross-attention mechanism to aggregate inter-dependency of input patterns and then fed into a Graph Attention Network (GAT) to capture inter-patient relationships according to feature affinity. This joint framework produces context-aware representations that can be used for classification. Experimental results show that the model can achieve predictive accuracy of 89.6 % and F1-score of 0.886 on average, which is superior to the state-of-the-art CNN-based baselines. Through simultaneously optimizing spatial detail, temporal dynamics, and clinical context, the Proposed SkinHarmoNet Model provides reliable and interpretable predictions, and its performance establishes the new state-of-the-art for multimodal dermatologic AI in a clinical setting.•Multimodal fusion: spatial, temporal, and semantic Modalities•Cross-attention and GAT: enhanced interaction of features•High performance: 89.6\ % accuracy, F1= 0.886.

摘要

本研究提出了一种用于早期皮肤疾病检测的端到端多模态学习框架，该框架整合了跨异构患者数据的空间、时间和语义信息。该框架由三个关键模块组成：（i）从皮肤镜图像中提取丰富视觉特征的EfficientNet-B4；（ii）通过时间注意力增强的双向长短期记忆网络（BiLSTM），用于对基于传感器的时间序列信号中的症状演变进行建模；（iii）ClinicalBERT，一种特定领域的变换器，用于从患者临床叙述中生成上下文嵌入。特定模态的特征通过多头交叉注意力机制进行组合，以聚合输入模式的相互依赖性，然后输入到图注意力网络（GAT）中，根据特征亲和力捕获患者间的关系。这个联合框架产生可用于分类的上下文感知表示。实验结果表明，该模型平均可实现89.6%的预测准确率和0.886的F1分数，优于基于卷积神经网络（CNN）的现有基线。通过同时优化空间细节、时间动态和临床上下文，所提出的SkinHarmoNet模型提供了可靠且可解释的预测，其性能在临床环境中为多模态皮肤病人工智能树立了新的先进水平。•多模态融合：空间、时间和语义模态•交叉注意力和GAT：增强特征交互•高性能：准确率89.6%，F1 = 0.886