Shivasree Yerrabati, RaviSankar V
Department of Computer Science & Engineering, GITAM Deemed to be University, Hyderabad, Telangana, India.
MethodsX. 2025 Aug 24;15:103584. doi: 10.1016/j.mex.2025.103584. eCollection 2025 Dec.
This study proposes an end-to-end multimodal learning framework for early skin disease detection, incorporating both their spatial, temporal, and semantic information across heterogeneous patient data. The framework is composed of three key modules: (i)EfficientNet-B4 that extracts rich visual features from dermoscopic images, (ii) aBiLSTM enhanced with temporal attentionto model symptom evolvement from sensor-based time-series signals, and (iii) ClinicalBERT, a domain-specific transformer that generates contextual embeddings from patient clinical narratives. Modality-specific features are combined with a multi-head cross-attention mechanism to aggregate inter-dependency of input patterns and then fed into a Graph Attention Network (GAT) to capture inter-patient relationships according to feature affinity. This joint framework produces context-aware representations that can be used for classification. Experimental results show that the model can achieve predictive accuracy of 89.6 % and F1-score of 0.886 on average, which is superior to the state-of-the-art CNN-based baselines. Through simultaneously optimizing spatial detail, temporal dynamics, and clinical context, the Proposed SkinHarmoNet Model provides reliable and interpretable predictions, and its performance establishes the new state-of-the-art for multimodal dermatologic AI in a clinical setting.•Multimodal fusion: spatial, temporal, and semantic Modalities•Cross-attention and GAT: enhanced interaction of features•High performance: 89.6\ % accuracy, F1= 0.886.
本研究提出了一种用于早期皮肤疾病检测的端到端多模态学习框架,该框架整合了跨异构患者数据的空间、时间和语义信息。该框架由三个关键模块组成:(i)从皮肤镜图像中提取丰富视觉特征的EfficientNet-B4;(ii)通过时间注意力增强的双向长短期记忆网络(BiLSTM),用于对基于传感器的时间序列信号中的症状演变进行建模;(iii)ClinicalBERT,一种特定领域的变换器,用于从患者临床叙述中生成上下文嵌入。特定模态的特征通过多头交叉注意力机制进行组合,以聚合输入模式的相互依赖性,然后输入到图注意力网络(GAT)中,根据特征亲和力捕获患者间的关系。这个联合框架产生可用于分类的上下文感知表示。实验结果表明,该模型平均可实现89.6%的预测准确率和0.886的F1分数,优于基于卷积神经网络(CNN)的现有基线。通过同时优化空间细节、时间动态和临床上下文,所提出的SkinHarmoNet模型提供了可靠且可解释的预测,其性能在临床环境中为多模态皮肤病人工智能树立了新的先进水平。•多模态融合:空间、时间和语义模态•交叉注意力和GAT:增强特征交互•高性能:准确率89.6%,F1 = 0.886