Prasanna Anish, Jing Bocheng, Plopper George, Miller Kristina Krasnov, Sanjak Jaleal, Feng Alice, Prezek Sarah, Vidyaprakash Eshaw, Thovarai Vishal, Maier Ezekiel J, Bhattacharya Avik, Naaman Lama, Stephens Holly, Watford Sean, Boscardin W John, Johanson Elaine, Lienau Amanda
Booz Allen Hamilton.
Northern California Institute for Research and Education.
medRxiv. 2023 Dec 13:2023.12.11.23298687. doi: 10.1101/2023.12.11.23298687.
The COVID-19 pandemic had disproportionate effects on the Veteran population due to the increased prevalence of medical and environmental risk factors. Synthetic electronic health record (EHR) data can help meet the acute need for Veteran population-specific predictive modeling efforts by avoiding the strict barriers to access, currently present within Veteran Health Administration (VHA) datasets. The U.S. Food and Drug Administration (FDA) and the VHA launched the precisionFDA COVID-19 Risk Factor Modeling Challenge to develop COVID-19 diagnostic and prognostic models; identify Veteran population-specific risk factors; and test the usefulness of synthetic data as a substitute for real data. The use of synthetic data boosted challenge participation by providing a dataset that was accessible to all competitors. Models trained on synthetic data showed similar but systematically inflated model performance metrics to those trained on real data. The important risk factors identified in the synthetic data largely overlapped with those identified from the real data, and both sets of risk factors were validated in the literature. Tradeoffs exist between synthetic data generation approaches based on whether a real EHR dataset is required as input. Synthetic data generated directly from real EHR input will more closely align with the characteristics of the relevant cohort. This work shows that synthetic EHR data will have practical value to the Veterans' health research community for the foreseeable future.
由于医疗和环境风险因素的患病率增加,新冠疫情对退伍军人产生了不成比例的影响。合成电子健康记录(EHR)数据可以通过避免目前退伍军人健康管理局(VHA)数据集中存在的严格访问障碍,帮助满足针对退伍军人的特定预测建模工作的迫切需求。美国食品药品监督管理局(FDA)和VHA发起了precisionFDA新冠风险因素建模挑战赛,以开发新冠诊断和预后模型;识别退伍军人特定的风险因素;并测试合成数据作为真实数据替代品的有用性。合成数据的使用通过提供一个所有参赛者都能访问的数据集,提高了挑战赛的参与度。在合成数据上训练的模型显示出与在真实数据上训练的模型相似但系统性夸大的模型性能指标。在合成数据中识别出的重要风险因素与从真实数据中识别出的风险因素在很大程度上重叠,并且两组风险因素都在文献中得到了验证。基于是否需要真实EHR数据集作为输入,合成数据生成方法之间存在权衡。直接从真实EHR输入生成的合成数据将更紧密地符合相关队列的特征。这项工作表明,在可预见的未来,合成EHR数据将对退伍军人健康研究界具有实用价值。