University of Illinois at Urbana-Champaign, 201 North Goodwin Avenue, Urbana, IL, USA.
Medisyn Inc., Las Vegas, NV, USA.
Nat Commun. 2023 Aug 31;14(1):5305. doi: 10.1038/s41467-023-41093-0.
Synthetic electronic health records (EHRs) that are both realistic and privacy-preserving offer alternatives to real EHRs for machine learning (ML) and statistical analysis. However, generating high-fidelity EHR data in its original, high-dimensional form poses challenges for existing methods. We propose Hierarchical Autoregressive Language mOdel (HALO) for generating longitudinal, high-dimensional EHR, which preserve the statistical properties of real EHRs and can train accurate ML models without privacy concerns. HALO generates a probability density function over medical codes, clinical visits, and patient records, allowing for generating realistic EHR data without requiring variable selection or aggregation. Extensive experiments demonstrated that HALO can generate high-fidelity data with high-dimensional disease code probabilities closely mirroring (above 0.9 R correlation) real EHR data. HALO also enhances the accuracy of predictive modeling and enables downstream ML models to attain similar accuracy as models trained on genuine data.
生成既真实又能保护隐私的合成电子健康记录 (EHR) 可以为机器学习 (ML) 和统计分析提供替代真实 EHR 的选择。然而,以原始的高维形式生成高保真 EHR 数据对现有方法提出了挑战。我们提出了层次自回归语言模型 (HALO) 来生成纵向、高维的 EHR,它保留了真实 EHR 的统计特性,可以在不涉及隐私问题的情况下训练准确的 ML 模型。HALO 对医疗代码、临床就诊和患者记录生成概率密度函数,允许生成逼真的 EHR 数据,而无需进行变量选择或聚合。广泛的实验表明,HALO 可以生成具有高维疾病代码概率的高保真数据,这些概率与真实 EHR 数据非常接近 (高于 0.9 R 相关性)。HALO 还可以提高预测建模的准确性,并使下游 ML 模型能够达到与基于真实数据训练的模型相似的准确性。