Li Deyi, Shukla Aditi, Chandaka Sravani, Taylor Bradley, Xu Jie, Liu Mei
Department of Health Outcomes & Biomedical Informatics, University of Florida, 1889 Museum Rd, 7th Floor, Suite 7000, Room 7012, Gainesville, FL, 32611, United States, 1 352-627-9143.
Department of Mathematics, College of Arts and Sciences, University of Pennsylvania, Philadelphia, PA, United States.
JMIR Med Inform. 2025 Jul 24;13:e68830. doi: 10.2196/68830.
BACKGROUND: By analyzing electronic health record snapshots of similar patients, physicians can proactively predict disease onsets, customize treatment plans, and anticipate patient-specific trajectories. However, the modeling of electronic health record data is inherently challenging due to its high dimensionality, mixed feature types, noise, bias, and sparsity. Patient representation learning using autoencoders (AEs) presents promising opportunities to address these challenges. A critical question remains: how do different AE designs and distance measures impact the quality of retrieved similar patient cohorts? OBJECTIVE: This study aims to evaluate the performance of 5 common AE variants-vanilla autoencoder, denoising autoencoder, contractive autoencoder, sparse autoencoder, and robust autoencoder-in retrieving similar patients. Additionally, it investigates the impact of different distance measures and hyperparameter configurations on model performance. METHODS: We tested the 5 AE variants on 2 real-world datasets-the University of Kansas Medical Center (n=13,752) and the Medical College of Wisconsin (n=9568)-across 168 different hyperparameter configurations. To retrieve similar patients based on the AE-produced latent representations, we applied k-nearest neighbors (k-NN) using Euclidean and Mahalanobis distances. Two prediction targets were evaluated: acute kidney injury onset and postdischarge 1-year mortality. RESULTS: Our findings demonstrate that (1) denoising autoencoders outperformed other AE variants when paired with Euclidean distance (P<.001), followed by vanilla autoencoders and contractive autoencoders; (2) learning rates significantly influenced the performance of AE variants; and (3) Mahalanobis distance-based k-NN frequently outperformed Euclidean distance-based k-NN when applied to latent representations. However, whether AE models are superior in transforming raw data into latent representations, compared with applying Mahalanobis distance-based k-NN directly to raw data, appears to be data-dependent. CONCLUSIONS: This study provides a comprehensive analysis of the performance of different AE variants in retrieving similar patients and evaluates the impact of various hyperparameter configurations on model performance. The findings lay the groundwork for future development of AE-based patient similarity estimation and personalized medicine.
背景:通过分析相似患者的电子健康记录快照,医生可以主动预测疾病发作、定制治疗方案并预测患者特定的病程。然而,由于电子健康记录数据具有高维度、混合特征类型、噪声、偏差和稀疏性,对其进行建模具有内在的挑战性。使用自动编码器(AE)进行患者表示学习为应对这些挑战提供了有希望的机会。一个关键问题仍然存在:不同的AE设计和距离度量如何影响检索到的相似患者队列的质量? 目的:本研究旨在评估5种常见AE变体——普通自动编码器、去噪自动编码器、收缩自动编码器、稀疏自动编码器和鲁棒自动编码器——在检索相似患者方面的性能。此外,还研究了不同距离度量和超参数配置对模型性能的影响。 方法:我们在2个真实世界数据集——堪萨斯大学医学中心(n = 13752)和威斯康星医学院(n = 9568)——上测试了这5种AE变体,涉及168种不同的超参数配置。为了基于AE生成的潜在表示检索相似患者,我们使用欧几里得距离和马氏距离应用k近邻(k-NN)算法。评估了两个预测目标:急性肾损伤发作和出院后1年死亡率。 结果:我们的研究结果表明:(1)与欧几里得距离配对时,去噪自动编码器的表现优于其他AE变体(P <.001),其次是普通自动编码器和收缩自动编码器;(2)学习率显著影响AE变体的性能;(3)应用于潜在表示时,基于马氏距离的k-NN通常优于基于欧几里得距离的k-NN。然而,与直接将基于马氏距离的k-NN应用于原始数据相比,AE模型在将原始数据转换为潜在表示方面是否更具优势似乎取决于数据。 结论:本研究对不同AE变体在检索相似患者方面的性能进行了全面分析,并评估了各种超参数配置对模型性能的影响。研究结果为基于AE的患者相似性估计和个性化医疗的未来发展奠定了基础。
Cochrane Database Syst Rev. 2018-1-16
2025-1
Clin Orthop Relat Res. 2024-9-1
JBI Database System Rev Implement Rep. 2016-4
J Med Internet Res. 2022-1-6
J Am Med Inform Assoc. 2020-7-1
Eur J Neurosci. 2018-10-14
J Biomed Inform. 2018-6-1
Converg Sci Phys Oncol. 2018-3
Biomed Rep. 2017-7