Gastrointestinal Unit, Massachusetts General Hospital, Boston, Massachusetts 02114, USA.
Inflamm Bowel Dis. 2013 Jun;19(7):1411-20. doi: 10.1097/MIB.0b013e31828133fd.
Previous studies identifying patients with inflammatory bowel disease using administrative codes have yielded inconsistent results. Our objective was to develop a robust electronic medical record-based model for classification of inflammatory bowel disease leveraging the combination of codified data and information from clinical text notes using natural language processing.
Using the electronic medical records of 2 large academic centers, we created data marts for Crohn's disease (CD) and ulcerative colitis (UC) comprising patients with ≥1 International Classification of Diseases, 9th edition, code for each disease. We used codified (i.e., International Classification of Diseases, 9th edition codes, electronic prescriptions) and narrative data from clinical notes to develop our classification model. Model development and validation was performed in a training set of 600 randomly selected patients for each disease with medical record review as the gold standard. Logistic regression with the adaptive LASSO penalty was used to select informative variables.
We confirmed 399 CD cases (67%) in the CD training set and 378 UC cases (63%) in the UC training set. For both, a combined model including narrative and codified data had better accuracy (area under the curve for CD 0.95; UC 0.94) than models using only disease International Classification of Diseases, 9th edition codes (area under the curve 0.89 for CD; 0.86 for UC). Addition of natural language processing narrative terms to our final model resulted in classification of 6% to 12% more subjects with the same accuracy.
Inclusion of narrative concepts identified using natural language processing improves the accuracy of electronic medical records case definition for CD and UC while simultaneously identifying more subjects compared with models using codified data alone.
先前使用管理代码识别炎症性肠病患者的研究得出了不一致的结果。我们的目标是利用自然语言处理技术,结合编码数据和临床文本记录中的信息,开发一种稳健的基于电子病历的炎症性肠病分类模型。
我们使用 2 家大型学术中心的电子病历创建了克罗恩病 (CD) 和溃疡性结肠炎 (UC) 的数据集市,每个疾病的数据集市都包含至少有 1 个国际疾病分类第 9 版 (ICD-9) 代码的患者。我们使用来自临床记录的编码(即 ICD-9 代码、电子处方)和叙述数据来开发我们的分类模型。在每个疾病的 600 名随机选择的患者的训练集中进行模型开发和验证,以病历审查作为金标准。使用自适应 LASSO 惩罚的逻辑回归选择信息性变量。
我们在 CD 训练集中确认了 399 例 CD 病例(67%),在 UC 训练集中确认了 378 例 UC 病例(63%)。对于这两种疾病,包含叙述和编码数据的综合模型的准确性(CD 的曲线下面积为 0.95;UC 的曲线下面积为 0.94)均优于仅使用疾病 ICD-9 代码的模型(CD 的曲线下面积为 0.89;UC 的曲线下面积为 0.86)。将自然语言处理叙述术语添加到我们的最终模型中,可在不降低准确性的情况下,将分类的患者数量增加 6%至 12%。
纳入使用自然语言处理识别的叙述概念可提高 CD 和 UC 的电子病历病例定义的准确性,同时与仅使用编码数据的模型相比,可识别更多的患者。