Elhussein Ahmed, Hripcsak George
Department of Biomedical Informatics, Columbia University, NY, USA.
New York Genome Center, NY, USA.
medRxiv. 2025 Apr 1:2025.03.31.25324972. doi: 10.1101/2025.03.31.25324972.
The dimensionality of electronic health record (EHR) data continues to grow as more clinical variables are recorded, often resulting in redundancy, sparsity, and analytical intractability. In this study, we apply non-negative matrix factorization (NMF) to a high-dimensional laboratory dataset of patients with type II diabetes to estimate the minimum latent dimensionality required to preserve clinically meaningful information. Using both within-patient imputation and across-patient generalization tasks, we evaluate the ability of the learned representations to reconstruct two key clinical lab values: blood glucose and HbA1c. Our findings show that clinically acceptable accuracy can be achieved with a dimensionality reduction of up to 80% and a dimensionality of 230 to 300, supporting the presence of a compact, low-dimensional latent structure underlying high-dimensional clinical data.
随着越来越多的临床变量被记录,电子健康记录(EHR)数据的维度持续增长,这常常导致冗余、稀疏性以及分析上的难处理性。在本研究中,我们将非负矩阵分解(NMF)应用于一个针对II型糖尿病患者的高维实验室数据集,以估计保留具有临床意义的信息所需的最小潜在维度。通过患者内插补和跨患者泛化任务,我们评估了学习到的表示重构两个关键临床实验室值(血糖和糖化血红蛋白)的能力。我们的研究结果表明,在维度降低高达80%且维度为230至300时,可以实现临床上可接受的准确性,这支持了高维临床数据背后存在紧凑的低维潜在结构。