Suppr超能文献

SADI:基于相似性感知扩散模型的不完整时间电子健康记录数据插补

SADI: Similarity-Aware Diffusion Model-Based Imputation for Incomplete Temporal EHR Data.

作者信息

Dai Zongyu, Getzen Emily, Long Qi

机构信息

University of Pennsylvania.

出版信息

Proc Mach Learn Res. 2024 May;238:4195-4203.

Abstract

Missing values are prevalent in temporal electronic health records (EHR) data and are known to complicate data analysis and lead to biased results. The current state-of-the-art (SOTA) models for imputing missing values in EHR primarily leverage correlations across time points and across features, which perform well when data have strong correlation across time points, such as in ICU data where high-frequency time series data are collected. However, this is often insufficient for temporal EHR data from non-ICU settings (e.g., outpatient visits for primary care or specialty care), where data are collected at substantially sparser time points, resulting in much weaker correlation across time points. To address this methodological gap, we propose the Similarity-Aware Diffusion Model-Based Imputation (SADI), a novel imputation method that leverages the diffusion model and utilizes information across dependent variables. We apply SADI to impute incomplete temporal EHR data and propose a similarity-aware denoising function, which includes a self-attention mechanism to model the correlations between time points, features, and similar patients. To the best of our knowledge, this is the first time that the information of similar patients is directly used to construct imputation for incomplete temporal EHR data. Our extensive experiments on two datasets, the Critical Path For Alzheimer's Disease (CPAD) data and the PhysioNet Challenge 2012 data, show that SADI outperforms the current SOTA under various missing data mechanisms, including missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR).

摘要

缺失值在时态电子健康记录(EHR)数据中普遍存在,并且已知会使数据分析复杂化并导致有偏差的结果。当前用于估算EHR中缺失值的最先进(SOTA)模型主要利用时间点之间和特征之间的相关性,当数据在时间点之间具有强相关性时,例如在收集高频时间序列数据的重症监护病房(ICU)数据中,这些模型表现良好。然而,对于来自非ICU环境的时态EHR数据(例如,初级保健或专科护理的门诊就诊),这通常是不够的,在这些环境中,数据在实质上更稀疏的时间点收集,导致时间点之间的相关性更弱。为了解决这一方法上的差距,我们提出了基于相似性感知扩散模型的插补法(SADI),这是一种新颖的插补方法,它利用扩散模型并利用跨因变量的信息。我们应用SADI来估算不完整的时态EHR数据,并提出了一种相似性感知去噪函数,该函数包括一个自注意力机制,用于对时间点、特征和相似患者之间的相关性进行建模。据我们所知,这是首次直接使用相似患者的信息来构建对不完整时态EHR数据的插补。我们在两个数据集上进行的广泛实验,即阿尔茨海默病关键路径(CPAD)数据和2012年生理网络挑战赛数据,表明SADI在各种缺失数据机制下均优于当前的SOTA,这些机制包括完全随机缺失(MCAR)、随机缺失(MAR)和非随机缺失(MNAR)。

相似文献

7
Missing data imputation using classification and regression trees.使用分类与回归树进行缺失数据插补
PeerJ Comput Sci. 2024 Jun 28;10:e2119. doi: 10.7717/peerj-cs.2119. eCollection 2024.

本文引用的文献

4
Ensuring Fairness in Machine Learning to Advance Health Equity.确保机器学习的公正性,以促进健康公平。
Ann Intern Med. 2018 Dec 18;169(12):866-872. doi: 10.7326/M18-1990. Epub 2018 Dec 4.
10
Electronic health records to facilitate clinical research.电子健康记录助力临床研究。
Clin Res Cardiol. 2017 Jan;106(1):1-9. doi: 10.1007/s00392-016-1025-6. Epub 2016 Aug 24.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验