Zhong Yuan, Cui Suhan, Wang Jiaqi, Wang Xiaochen, Yin Ziyi, Wang Yaqing, Xiao Houping, Huai Mengdi, Wang Ting, Ma Fenglong
The Pennsylvania State University.
Purdue University.
Proc SIAM Int Conf Data Min. 2024;2024:499-507. doi: 10.1137/1.9781611978032.58.
Health risk prediction aims to forecast the potential health risks that patients may face using their historical Electronic Health Records (EHR). Although several effective models have developed, data insufficiency is a key issue undermining their effectiveness. Various data generation and augmentation methods have been introduced to mitigate this issue by expanding the size of the training data set through learning underlying data distributions. However, the performance of these methods is often limited due to their task-unrelated design. To address these shortcomings, this paper introduces a novel, end-to-end diffusion-based risk prediction model, named MedDiffusion. It enhances risk prediction performance by creating synthetic patient data during training to enlarge sample space. Furthermore, MedDiffusion discerns hidden relationships between patient visits using a step-wise attention mechanism, enabling the model to automatically retain the most vital information for generating high-quality data. Experimental evaluation on four real-world medical datasets demonstrates that MedDiffusion outperforms 14 cutting-edge baselines in terms of PR-AUC, F1, and Cohen's Kappa. We also conduct ablation studies and benchmark our model against GAN-based alternatives to further validate the rationality and adaptability of our model design. Additionally, we analyze generated data to offer fresh insights into the model's interpretability. The source code is available via https://shorturl.at/aerT0.
健康风险预测旨在利用患者的历史电子健康记录(EHR)来预测患者可能面临的潜在健康风险。尽管已经开发了几种有效的模型,但数据不足是削弱其有效性的关键问题。已经引入了各种数据生成和增强方法,通过学习潜在数据分布来扩大训练数据集的大小,以缓解这个问题。然而,由于这些方法与任务无关的设计,它们的性能往往受到限制。为了解决这些缺点,本文介绍了一种新颖的、基于端到端扩散的风险预测模型,名为MedDiffusion。它通过在训练期间创建合成患者数据来扩大样本空间,从而提高风险预测性能。此外,MedDiffusion使用逐步注意力机制识别患者就诊之间的隐藏关系,使模型能够自动保留最重要的信息以生成高质量数据。对四个真实世界医疗数据集的实验评估表明,MedDiffusion在PR-AUC、F1和科恩卡帕方面优于14个前沿基线。我们还进行了消融研究,并将我们的模型与基于GAN的替代方案进行基准测试,以进一步验证我们模型设计的合理性和适应性。此外,我们分析生成的数据,以提供对模型可解释性的新见解。源代码可通过https://shorturl.at/aerT0获得。