Division of Cardiovascular Medicine Department of Internal Medicine University of Utah School of Medicine Salt Lake City UT.
Division of Epidemiology Department of Internal Medicine University of Utah School of Medicine Salt Lake City UT.
J Am Heart Assoc. 2020 Mar 3;9(5):e014527. doi: 10.1161/JAHA.119.014527. Epub 2020 Feb 26.
Background Electronic medical records (EMRs) allow identification of disease-specific patient populations, but varying electronic cohort definitions could result in different populations. We compared the characteristics of an electronic medical record-derived atrial fibrillation (AF) patient population using 5 different electronic cohort definitions. Methods and Results Adult patients with at least 1 AF billing code from January 1, 2010, to December 31, 2017, were included. Based on different electronic cohort definitions, we trained 5 different logistic regression models using a labeled training data set (n=786). Each model yielded a predicted probability; patients were classified as having AF if the probability was higher than a specified cut point. Test characteristics were calculated for each model. These models were then applied to the full cohort and resulting characteristics were compared. In the training set, the comprehensive model (including demographics, billing codes, and natural language processing results) performed best, with an area under the curve of 0.89, sensitivity of 0.90, and specificity of 0.87. Among a candidate population (n=22 000), the proportion of patients identified as having AF varied from 61% in the model using diagnosis or procedure () billing codes to 83% in the model using natural language processing of clinical notes. Among identified AF patients, the proportion of patients with a CHADS-VASc score ≥2 varied from 69% to 85%; oral anticoagulant treatment rates varied from 50% to 66% depending on the model. Conclusions Different electronic cohort definitions result in substantially different AF study samples. This difference threatens the quality and reproducibility of electronic medical record-based research and quality initiatives.
背景 电子病历(EMR)允许识别特定疾病的患者人群,但不同的电子队列定义可能会导致不同的人群。我们比较了使用 5 种不同电子队列定义的电子病历衍生的心房颤动(AF)患者人群的特征。
方法和结果 纳入至少有 1 次 AF 计费代码的成年患者,时间范围为 2010 年 1 月 1 日至 2017 年 12 月 31 日。基于不同的电子队列定义,我们使用标记的训练数据集(n=786)训练了 5 种不同的逻辑回归模型。每个模型产生一个预测概率;如果概率高于指定的截断点,则患者被归类为患有 AF。为每个模型计算了测试特征。然后将这些模型应用于整个队列,并比较得到的特征。在训练集中,综合模型(包括人口统计学、计费代码和自然语言处理结果)表现最好,曲线下面积为 0.89,灵敏度为 0.90,特异性为 0.87。在候选人群(n=22000)中,使用诊断或程序计费代码的模型确定的 AF 患者比例从 61%到使用临床记录的自然语言处理的模型的 83%不等。在确定的 AF 患者中,CHADS-VASc 评分≥2 的患者比例从 69%到 85%不等;根据模型的不同,口服抗凝治疗率从 50%到 66%不等。
结论 不同的电子队列定义导致 AF 研究样本存在显著差异。这种差异威胁到基于电子病历的研究和质量计划的质量和可重复性。