The Harker School, San Jose, California.
AMIA Annu Symp Proc. 2022 Feb 21;2021:448-456. eCollection 2021.
Current COVID-19 predictive models primarily focus on predicting the risk of mortality, and rely on COVID-19 specific medical data such as chest imaging after COVID-19 diagnosis. In this project, we developed an innovative supervised machine learning pipeline using longitudinal Electronic Health Records (EHR) to accurately predict COVID-19 related health outcomes including mortality, ventilation, days in hospital or ICU. In particular, we developed unique and effective data processing algorithms, including data cleaning, initial feature screening, vector representation. Then we trained models using state-of-the-art machine learning strategies combined with different parameter settings. Based on routinely collected EHR, our machine learning pipeline not only consistently outperformed those developed by other research groups using the same set of data, but also achieved similar accuracy as those trained on medical data that were only available after COVID-19 diagnosis. In addition, top risk factors for COVID-19 were identified, and are consistent with epidemiologic findings.
目前的 COVID-19 预测模型主要集中在预测死亡率的风险上,并依赖于 COVID-19 特定的医疗数据,如 COVID-19 诊断后的胸部成像。在这个项目中,我们使用纵向电子健康记录 (EHR) 开发了一个创新的监督机器学习管道,以准确预测 COVID-19 相关的健康结果,包括死亡率、通气、住院或 ICU 天数。特别是,我们开发了独特而有效的数据处理算法,包括数据清理、初始特征筛选、向量表示。然后,我们使用最先进的机器学习策略结合不同的参数设置来训练模型。基于常规收集的 EHR,我们的机器学习管道不仅始终优于其他研究小组使用相同数据集开发的模型,而且与仅在 COVID-19 诊断后可用的医疗数据训练的模型具有相似的准确性。此外,确定了 COVID-19 的主要风险因素,这些因素与流行病学发现一致。