Brigham and Women's Hospital, Boston, Massachusetts, USA.
Arthritis Care Res (Hoboken). 2010 Aug;62(8):1120-7. doi: 10.1002/acr.20184.
Electronic medical records (EMRs) are a rich data source for discovery research but are underutilized due to the difficulty of extracting highly accurate clinical data. We assessed whether a classification algorithm incorporating narrative EMR data (typed physician notes) more accurately classifies subjects with rheumatoid arthritis (RA) compared with an algorithm using codified EMR data alone.
Subjects with > or =1 International Classification of Diseases, Ninth Revision RA code (714.xx) or who had anti-cyclic citrullinated peptide (anti-CCP) checked in the EMR of 2 large academic centers were included in an "RA Mart" (n = 29,432). For all 29,432 subjects, we extracted narrative (using natural language processing) and codified RA clinical information. In a training set of 96 RA and 404 non-RA cases from the RA Mart classified by medical record review, we used narrative and codified data to develop classification algorithms using logistic regression. These algorithms were applied to the entire RA Mart. We calculated and compared the positive predictive value (PPV) of these algorithms by reviewing the records of an additional 400 subjects classified as having RA by the algorithms.
A complete algorithm (narrative and codified data) classified RA subjects with a significantly higher PPV of 94% than an algorithm with codified data alone (PPV of 88%). Characteristics of the RA cohort identified by the complete algorithm were comparable to existing RA cohorts (80% women, 63% anti-CCP positive, and 59% positive for erosions).
We demonstrate the ability to utilize complete EMR data to define an RA cohort with a PPV of 94%, which was superior to an algorithm using codified data alone.
电子病历(EMR)是发现研究的丰富数据源,但由于难以提取高度准确的临床数据,因此未得到充分利用。我们评估了一种分类算法,该算法将叙事性 EMR 数据(已输入的医师笔记)与仅使用编码 EMR 数据的算法相结合,是否能更准确地对类风湿关节炎(RA)患者进行分类。
在两个大型学术中心的 EMR 中,至少有 1 个国际疾病分类,第 9 版 RA 代码(714.xx)或抗环瓜氨酸肽(抗-CCP)检查的患者被纳入“RA Mart”(n=29432)。对于所有 29432 例患者,我们从自然语言处理中提取了叙事性(使用自然语言处理)和编码性 RA 临床信息。在 RA Mart 中,有 96 例 RA 和 404 例非 RA 患者的病历审查分类的训练集中,我们使用叙事和编码数据,通过逻辑回归开发分类算法。将这些算法应用于整个 RA Mart。我们通过对另外 400 例被算法分类为 RA 的患者的病历进行审查,计算并比较了这些算法的阳性预测值(PPV)。
完整的算法(叙事和编码数据)对 RA 患者的分类具有 94%的显著更高 PPV,高于仅使用编码数据的算法(88%的 PPV)。完整算法识别的 RA 队列的特征与现有 RA 队列相似(80%的女性、63%的抗-CCP 阳性和 59%的侵蚀阳性)。
我们证明了使用完整的 EMR 数据定义 RA 队列的能力,其 PPV 为 94%,优于仅使用编码数据的算法。