Gomez Louis Adedapo, Claassen Jan, Kleinberg Samantha
Stevens Institute of Technology, 1 Castle Point Terrace, Hoboken, 07030, NJ, USA.
Department of Neurology, Columbia University Irving Medical Center, 710 West 168th Street, New York, 10032, NY, USA; New York Presbyterian Hospital, Praveen Hospital Lane, New York, 10032, NY, USA.
J Biomed Inform. 2025 Jun;166:104828. doi: 10.1016/j.jbi.2025.104828. Epub 2025 Apr 22.
Healthcare data provides a unique opportunity to learn causal relationships but the largest datasets, such as from hospitals or intensive care units, are often observational and do not standardize variables collected for all patients. Rather, the variables depend on a patient's health status, treatment plan, and differences between providers. This poses major challenges for causal inference, which either must restrict analysis to patients with complete data (reducing power) or learn patient-specific models (making it difficult to generalize). While missing variables can lead to confounding, variables absent for one individual are often measured in another.
We propose a novel method, called Causal Model Combination for Time Series (CMC-TS), to learn causal relationships from time series with partially overlapping variable sets. CMC-TS overcomes errors by specifically leveraging partial overlap between datasets (e.g., patients) to iteratively reconstruct missing variables and correct errors by reweighting inferences using shared information across datasets. We evaluated CMC-TS and compared it to the state of the art on both simulated data and real-world data from stroke patients admitted to a neurological intensive care unit.
On simulated data, CMC-TS had the fewest false discoveries and highest F1-score compared to baselines. On real data from stroke patients in a neurological intensive care unit, we found fewer implausible and more highly ranked plausible causes of a clinically important adverse event.
Our approach may lead to better use of observational healthcare data for causal inference, by enabling causal inference from patient data with partially overlapping variable sets.
医疗保健数据为了解因果关系提供了独特的机会,但最大的数据集,如来自医院或重症监护病房的数据,往往是观察性的,并且没有对为所有患者收集的变量进行标准化。相反,这些变量取决于患者的健康状况、治疗计划以及提供者之间的差异。这给因果推断带来了重大挑战,因果推断要么必须将分析限制在具有完整数据的患者身上(降低效能),要么学习特定患者的模型(难以进行概括)。虽然缺失变量会导致混杂,但一个个体缺失的变量通常在另一个个体中有所测量。
我们提出了一种名为时间序列因果模型组合(CMC - TS)的新方法,用于从具有部分重叠变量集的时间序列中学习因果关系。CMC - TS通过专门利用数据集(例如患者)之间的部分重叠来迭代重建缺失变量,并通过使用跨数据集的共享信息对推断进行重新加权来纠正错误,从而克服误差。我们对CMC - TS进行了评估,并将其与来自神经重症监护病房的中风患者的模拟数据和真实世界数据的现有最佳方法进行了比较。
在模拟数据上,与基线相比,CMC - TS的错误发现最少,F1分数最高。在神经重症监护病房中风患者的真实数据上,我们发现临床上重要不良事件的不合理原因更少,合理原因的排名更高。
我们的方法可能通过允许从具有部分重叠变量集的患者数据进行因果推断,从而更好地利用观察性医疗保健数据进行因果推断。