Matsui Hiroki, Fushimi Kiyohide, Yasunaga Hideo
Department of Clinical Epidemiology and Health Economics, School of Public Health, The University of Tokyo, 7-3-1 Hongo, Bunkyo-Ku, Tokyo, 1130033, Japan.
Department of Health Policy and Informatics, Institute of Science Tokyo Graduate School of Medical and Dental Sciences, 1-5-45 Yushima, Bunkyo-Ku, Tokyo, 1138519, Japan.
BMC Med Res Methodol. 2025 Apr 11;25(1):95. doi: 10.1186/s12874-025-02549-7.
Unmeasured confounders pose challenges when observational data are analysed in comparative effectiveness studies. Integrating high-dimensional administrative claims data may help adjust for unmeasured confounders. We determined whether distributed representations can compress high-dimensional administrative claims data to adjust for unmeasured confounders.
Using the Japanese Diagnosis Procedure Combination (DPC) database from 1291 hospitals (between April 2018 and March 2020), we applied the word2vec algorithm to create distributed representations for all medical codes. We focused on patients with heart failure (HF) and simulated four risk-adjustment models: 1, no adjustment; 2, adjusting for previously reported confounders; 3, adjusting for the sum of distributed representation weights of administrative claims data on the day of hospitalisation (novel method); and 4, a combination of models 2 and 3. We re-evaluated a previous study on the effect of early rehabilitation in patients with HF and compared these risk-adjustment methods (models 1-4).
Distributed representations were generated from the data of 15 998 963 in-patients, and 319 581 HF patients were identified. In the simulation study, Model 3 reduced the impact of unmeasured confounders and achieved better covariate balances than Model 1. Model 4 showed no increase in bias compared with the true model (Model 2) and was used as a reference model in the real-world application. When applied to a previous study, models 3 and 4 showed similar results.
Distributed representation can compress detailed administrative claims data and adjust for unmeasured confounders in comparative effectiveness studies.
在比较效果研究中分析观察性数据时,未测量的混杂因素会带来挑战。整合高维管理索赔数据可能有助于调整未测量的混杂因素。我们确定分布式表示是否可以压缩高维管理索赔数据以调整未测量的混杂因素。
使用来自1291家医院(2018年4月至2020年3月)的日本诊断程序组合(DPC)数据库,我们应用word2vec算法为所有医疗代码创建分布式表示。我们关注心力衰竭(HF)患者,并模拟了四种风险调整模型:1,不调整;2,调整先前报告的混杂因素;3,调整住院当天管理索赔数据的分布式表示权重总和(新方法);4,模型2和3的组合。我们重新评估了先前关于HF患者早期康复效果的研究,并比较了这些风险调整方法(模型1-4)。
从15998963名住院患者的数据中生成了分布式表示,共识别出319581名HF患者。在模拟研究中,模型3减少了未测量混杂因素的影响,并且比模型1实现了更好的协变量平衡。与真实模型(模型2)相比,模型4的偏差没有增加,并且在实际应用中用作参考模型。当应用于先前的研究时,模型3和4显示出相似的结果。
分布式表示可以压缩详细的管理索赔数据,并在比较效果研究中调整未测量的混杂因素。