通过分层自回归语言模型合成超高维纵向电子健康记录。

Synthesize Extremely High-dimensional Longitudinal Electronic Health Records via Hierarchical Autoregressive Language Model.

作者信息

Theodorou Brandon, Xiao Cao, Sun Jimeng

机构信息

University of Illinois Urbana-Champaign.

Relativity.

出版信息

Res Sq. 2023 Mar 10:rs.3.rs-2644725. doi: 10.21203/rs.3.rs-2644725/v1.

DOI:10.21203/rs.3.rs-2644725/v1

PMID:36945542

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10029081/

Abstract

Synthetic electronic health records (EHRs) that are both realistic and preserve privacy can serve as an alternative to real EHRs for machine learning (ML) modeling and statistical analysis. However, generating high-fidelity and granular electronic health record (EHR) data in its original, highly-dimensional form poses challenges for existing methods due to the complexities inherent in high-dimensional data. In this paper, we propose Hierarchical Autoregressive Language mOdel (HALO) for generating longitudinal high-dimensional EHR, which preserve the statistical properties of real EHR and can be used to train accurate ML models without privacy concerns. Our HALO method, designed as a hierarchical autoregressive model, generates a probability density function of medical codes, clinical visits, and patient records, allowing for the generation of realistic EHR data in its original, unaggregated form without the need for variable selection or aggregation. Additionally, our model also produces high-quality continuous variables in a longitudinal and probabilistic manner. We conducted extensive experiments and demonstrate that HALO can generate high-fidelity EHR data with high-dimensional disease code probabilities ( ≈ 10,000), disease code co-occurrence probabilities within a visit ( ≈ 1,000,000), and conditional probabilities across consecutive visits ( ≈ 5,000,000) and achieve above 0.9 correlation in comparison to real EHR data. In comparison to the leading baseline, HALO improves predictive modeling by over 17% in its predictive accuracy and perplexity on a hold-off test set of real EHR data. This performance then enables downstream ML models trained on its synthetic data to achieve comparable accuracy to models trained on real data (0.938 area under the ROC curve with HALO data vs. 0.943 with real data). Finally, using a combination of real and synthetic data enhances the accuracy of ML models beyond that achieved by using only real EHR data.

摘要

既逼真又能保护隐私的合成电子健康记录（EHR）可作为真实EHR的替代品，用于机器学习（ML）建模和统计分析。然而，以其原始的高维形式生成高保真且粒度精细的电子健康记录（EHR）数据，由于高维数据固有的复杂性，给现有方法带来了挑战。在本文中，我们提出了用于生成纵向高维EHR的分层自回归语言模型（HALO），它保留了真实EHR的统计特性，可用于训练准确的ML模型而无需担心隐私问题。我们的HALO方法设计为分层自回归模型，生成医学代码、临床就诊和患者记录的概率密度函数，能够以原始的、未聚合的形式生成逼真的EHR数据，而无需进行变量选择或聚合。此外，我们的模型还能以纵向和概率的方式生成高质量的连续变量。我们进行了广泛的实验，证明HALO能够生成具有高维疾病代码概率（≈10,000）、就诊内疾病代码共现概率（≈1,000,000）以及连续就诊间条件概率（≈5,000,000）的高保真EHR数据，并且与真实EHR数据相比，相关性达到0.9以上。与领先的基线相比，HALO在真实EHR数据的保留测试集上，预测准确性提高了17%以上，困惑度也有所降低。这种性能使得在其合成数据上训练的下游ML模型能够达到与在真实数据上训练的模型相当的准确性（使用HALO数据时ROC曲线下面积为0.938，使用真实数据时为0.943）。最后，使用真实数据和合成数据的组合，能够提高ML模型的准确性，超过仅使用真实EHR数据所达到的水平。