Bennett Tellen D, Moffitt Richard A, Hajagos Janos G, Amor Benjamin, Anand Adit, Bissell Mark M, Bradwell Katie Rebecca, Bremer Carolyn, Byrd James Brian, Denham Alina, DeWitt Peter E, Gabriel Davera, Garibaldi Brian T, Girvin Andrew T, Guinney Justin, Hill Elaine L, Hong Stephanie S, Jimenez Hunter, Kavuluru Ramakanth, Kostka Kristin, Lehmann Harold P, Levitt Eli, Mallipattu Sandeep K, Manna Amin, McMurry Julie A, Morris Michele, Muschelli John, Neumann Andrew J, Palchuk Matvey B, Pfaff Emily R, Qian Zhenglong, Qureshi Nabeel, Russell Seth, Spratt Heidi, Walden Anita, Williams Andrew E, Wooldridge Jacob T, Yoo Yun Jae, Zhang Xiaohan Tanner, Zhu Richard L, Austin Christopher P, Saltz Joel H, Gersing Ken R, Haendel Melissa A, Chute Christopher G
Section of Informatics and Data Science, Department of Pediatrics, University of Colorado School of Medicine, University of Colorado, Aurora, CO, USA.
Department of Biomedical Informatics, Stony Brook University, Stony Brook, NY, USA.
medRxiv. 2021 Jan 23:2021.01.12.21249511. doi: 10.1101/2021.01.12.21249511.
The majority of U.S. reports of COVID-19 clinical characteristics, disease course, and treatments are from single health systems or focused on one domain. Here we report the creation of the National COVID Cohort Collaborative (N3C), a centralized, harmonized, high-granularity electronic health record repository that is the largest, most representative U.S. cohort of COVID-19 cases and controls to date. This multi-center dataset supports robust evidence-based development of predictive and diagnostic tools and informs critical care and policy.
In a retrospective cohort study of 1,926,526 patients from 34 medical centers nationwide, we stratified patients using a World Health Organization COVID-19 severity scale and demographics; we then evaluated differences between groups over time using multivariable logistic regression. We established vital signs and laboratory values among COVID-19 patients with different severities, providing the foundation for predictive analytics. The cohort included 174,568 adults with severe acute respiratory syndrome associated with SARS-CoV-2 (PCR >99% or antigen <1%) as well as 1,133,848 adult patients that served as lab-negative controls. Among 32,472 hospitalized patients, mortality was 11.6% overall and decreased from 16.4% in March/April 2020 to 8.6% in September/October 2020 (p = 0.002 monthly trend). In a multivariable logistic regression model, age, male sex, liver disease, dementia, African-American and Asian race, and obesity were independently associated with higher clinical severity. To demonstrate the utility of the N3C cohort for analytics, we used machine learning (ML) to predict clinical severity and risk factors over time. Using 64 inputs available on the first hospital day, we predicted a severe clinical course (death, discharge to hospice, invasive ventilation, or extracorporeal membrane oxygenation) using random forest and XGBoost models (AUROC 0.86 and 0.87 respectively) that were stable over time. The most powerful predictors in these models are patient age and widely available vital sign and laboratory values. The established expected trajectories for many vital signs and laboratory values among patients with different clinical severities validates observations from smaller studies, and provides comprehensive insight into COVID-19 characterization in U.S. patients.
This is the first description of an ongoing longitudinal observational study of patients seen in diverse clinical settings and geographical regions and is the largest COVID-19 cohort in the United States. Such data are the foundation for ML models that can be the basis for generalizable clinical decision support tools. The N3C Data Enclave is unique in providing transparent, reproducible, easily shared, versioned, and fully auditable data and analytic provenance for national-scale patient-level EHR data. The N3C is built for intensive ML analyses by academic, industry, and citizen scientists internationally. Many observational correlations can inform trial designs and care guidelines for this new disease.
美国大多数关于新冠病毒疾病(COVID-19)临床特征、病程及治疗方法的报告来自单一医疗系统,或聚焦于某一领域。在此,我们报告国家新冠队列协作项目(N3C)的创建情况,这是一个集中化、经过协调且具有高粒度的电子健康记录库,是迄今为止美国规模最大、最具代表性的新冠病例及对照队列。这个多中心数据集支持基于可靠证据开发预测和诊断工具,并为重症监护及政策制定提供信息依据。
在一项对全国34个医疗中心的1926526例患者进行的回顾性队列研究中,我们使用世界卫生组织的COVID-19严重程度量表和人口统计学数据对患者进行分层;然后使用多变量逻辑回归评估不同组随时间的差异。我们确定了不同严重程度的COVID-19患者的生命体征和实验室检查值,为预测分析奠定了基础。该队列包括174568例患有与严重急性呼吸综合征相关的新冠病毒(PCR检测阳性率>99%或抗原检测阴性率<1%)的成年患者,以及1133848例作为实验室阴性对照的成年患者。在32472例住院患者中,总体死亡率为11.6%,从2020年3月/4月的16.4%降至2020年9月/10月的8.6%(每月趋势p = 0.002)。在多变量逻辑回归模型中,年龄、男性、肝病、痴呆、非裔美国人和亚裔种族以及肥胖与更高的临床严重程度独立相关。为证明N3C队列在分析方面的实用性,我们使用机器学习(ML)来预测随时间变化的临床严重程度和风险因素。利用患者入院第一天可获取的64项输入数据,我们使用随机森林和XGBoost模型(分别为0.86和0.87的曲线下面积)预测严重临床病程(死亡、转至临终关怀机构、有创通气或体外膜肺氧合),这些模型随时间稳定。这些模型中最有力的预测因素是患者年龄以及广泛可得的生命体征和实验室检查值。为不同临床严重程度患者建立的许多生命体征和实验室检查值的预期轨迹,验证了来自较小规模研究的观察结果,并全面深入了解了美国患者的COVID-19特征。
这是对在不同临床环境和地理区域中患者进行的一项正在进行的纵向观察性研究的首次描述,也是美国最大的COVID-19队列。此类数据是机器学习模型的基础,而这些模型可为通用临床决策支持工具奠定基础。N3C数据专区在为国家层面患者级电子健康记录数据提供透明、可重复、易于共享、有版本记录且完全可审计的数据及分析来源方面独具特色。N3C专为国际学术界、产业界和公民科学家进行深入的机器学习分析而构建。许多观察到的相关性可为这种新疾病的试验设计和护理指南提供参考。