Department of Biomedical Informatics, Vanderbilt University Medical Center, West End Ave, Suite 1475, Nashville, TN 37203, USA.
Department of Biomedical Informatics, Vanderbilt University Medical Center, West End Ave, Suite 1475, Nashville, TN 37203, USA.
J Biomed Inform. 2021 May;117:103777. doi: 10.1016/j.jbi.2021.103777. Epub 2021 Apr 8.
From the start of the coronavirus disease 2019 (COVID-19) pandemic, researchers have looked to electronic health record (EHR) data as a way to study possible risk factors and outcomes. To ensure the validity and accuracy of research using these data, investigators need to be confident that the phenotypes they construct are reliable and accurate, reflecting the healthcare settings from which they are ascertained. We developed a COVID-19 registry at a single academic medical center and used data from March 1 to June 5, 2020 to assess differences in population-level characteristics in pandemic and non-pandemic years respectively. Median EHR length, previously shown to impact phenotype performance in type 2 diabetes, was significantly shorter in the SARS-CoV-2 positive group relative to a 2019 influenza tested group (median 3.1 years vs 8.7; Wilcoxon rank sum P = 1.3e-52). Using three phenotyping methods of increasing complexity (billing codes alone and domain-specific algorithms provided by an EHR vendor and clinical experts), common medical comorbidities were abstracted from COVID-19 EHRs, defined by the presence of a positive laboratory test (positive predictive value 100%, recall 93%). After combining performance data across phenotyping methods, we observed significantly lower false negative rates for those records billed for a comprehensive care visit (p = 4e-11) and those with complete demographics data recorded (p = 7e-5). In an early COVID-19 cohort, we found that phenotyping performance of nine common comorbidities was influenced by median EHR length, consistent with previous studies, as well as by data density, which can be measured using portable metrics including CPT codes. Here we present those challenges and potential solutions to creating deeply phenotyped, acute COVID-19 cohorts.
从 2019 冠状病毒病(COVID-19)大流行开始,研究人员就一直将电子健康记录(EHR)数据视为研究可能的危险因素和结果的一种方法。为了确保使用这些数据进行研究的有效性和准确性,研究人员需要确信他们构建的表型是可靠和准确的,反映了他们确定的医疗保健环境。我们在一家学术医疗中心开发了一个 COVID-19 注册中心,并使用 2020 年 3 月 1 日至 6 月 5 日的数据,分别评估大流行年和非大流行年的人群水平特征差异。先前研究表明,EHR 长度中位数会影响 2 型糖尿病表型的性能,在 SARS-CoV-2 阳性组中,EHR 长度中位数明显短于 2019 年流感检测组(中位数分别为 3.1 年和 8.7 年;Wilcoxon 秩和检验 P=1.3e-52)。使用三种表型方法(仅计费代码以及由 EHR 供应商和临床专家提供的特定于域的算法),从 COVID-19 EHR 中提取常见的合并症,通过阳性实验室检测来定义(阳性预测值 100%,召回率 93%)。在结合了跨表型方法的性能数据后,我们观察到在计费代码为综合护理就诊的记录(p=4e-11)和记录了完整人口统计学数据的记录(p=7e-5)中,假阴性率显著降低。在一个早期的 COVID-19 队列中,我们发现 9 种常见合并症的表型性能受到 EHR 长度中位数的影响,这与之前的研究一致,也受到数据密度的影响,数据密度可以使用包括 CPT 代码在内的便携式指标进行测量。在这里,我们提出了创建深度表型化急性 COVID-19 队列的挑战和潜在解决方案。