Hung Chuan-Sheng, Lin Chun-Hung Richard, Liu Jain-Shing, Chen Shi-Huang, Hung Tsung-Chi, Tsai Chih-Min
Department of Computer Science and Engineering, National Sun Yat-sen University, Kaohsiung, Taiwan.
Artificial Intelligence Research and Promotion Center, National Sun Yat-sen University, Kaohsiung, Taiwan.
PLoS One. 2024 Dec 31;19(12):e0314995. doi: 10.1371/journal.pone.0314995. eCollection 2024.
Kawasaki Disease (KD) is a rare febrile illness affecting infants and young children, potentially leading to coronary artery complications and, in severe cases, mortality if untreated. However, KD is frequently misdiagnosed as a common fever in clinical settings, and the inherent data imbalance further complicates accurate prediction when using traditional machine learning and statistical methods. This paper introduces two advanced approaches to address these challenges, enhancing prediction accuracy and generalizability. The first approach proposes a stacking model termed the Disease Classifier (DC), specifically designed to recognize minority class samples within imbalanced datasets, thereby mitigating the bias commonly observed in traditional models toward the majority class. Secondly, we introduce a combined model, the Disease Classifier with CTGAN (CTGAN-DC), which integrates DC with Conditional Tabular Generative Adversarial Network (CTGAN) technology to improve data balance and predictive performance further. Utilizing CTGAN-based oversampling techniques, this model retains the original data characteristics of KD while expanding data diversity. This effectively balances positive and negative KD samples, significantly reducing model bias toward the majority class and enhancing both predictive accuracy and generalizability. Experimental evaluations indicate substantial performance gains, with the DC and CTGAN-DC models achieving notably higher predictive accuracy than individual machine learning models. Specifically, the DC model achieves sensitivity and specificity rates of 95%, while the CTGAN-DC model achieves 95% sensitivity and 97% specificity, demonstrating superior recognition capability. Furthermore, both models exhibit strong generalizability across diverse KD datasets, particularly the CTGAN-DC model, which surpasses the JAMA model with a 3% increase in sensitivity and a 95% improvement in generalization sensitivity and specificity, effectively resolving the model collapse issue observed in the JAMA model. In sum, the proposed DC and CTGAN-DC architectures demonstrate robust generalizability across multiple KD datasets from various healthcare institutions and significantly outperform other models, including XGBoost. These findings lay a solid foundation for advancing disease prediction in the context of imbalanced medical data.
川崎病(KD)是一种影响婴幼儿的罕见发热性疾病,如果不治疗,可能会导致冠状动脉并发症,严重时会导致死亡。然而,在临床环境中,KD经常被误诊为普通发热,而且固有的数据不平衡在使用传统机器学习和统计方法时进一步使准确预测变得复杂。本文介绍了两种先进方法来应对这些挑战,提高预测准确性和泛化能力。第一种方法提出了一种称为疾病分类器(DC)的堆叠模型,专门设计用于识别不平衡数据集中的少数类样本,从而减轻传统模型中常见的对多数类的偏差。其次,我们引入了一种组合模型,即带有条件表格生成对抗网络(CTGAN)的疾病分类器(CTGAN-DC),它将DC与条件表格生成对抗网络(CTGAN)技术相结合,以进一步改善数据平衡和预测性能。利用基于CTGAN的过采样技术,该模型在扩展数据多样性的同时保留了KD的原始数据特征。这有效地平衡了KD阳性和阴性样本,显著降低了模型对多数类的偏差,并提高了预测准确性和泛化能力。实验评估表明性能有显著提升,DC和CTGAN-DC模型的预测准确性明显高于单个机器学习模型。具体而言,DC模型的灵敏度和特异度达到95%,而CTGAN-DC模型的灵敏度为95%,特异度为97%,显示出卓越的识别能力。此外,这两种模型在不同的KD数据集上都表现出很强的泛化能力,特别是CTGAN-DC模型,其灵敏度提高了3%,泛化灵敏度和特异度提高了95%,超过了JAMA模型,有效解决了JAMA模型中观察到的模型崩溃问题。总之,所提出的DC和CTGAN-DC架构在来自不同医疗机构的多个KD数据集上表现出强大的泛化能力,并且明显优于其他模型,包括XGBoost。这些发现为在不平衡医疗数据背景下推进疾病预测奠定了坚实基础。