Am J Epidemiol. 2014 Mar 15;179(6):749-58. doi: 10.1093/aje/kwt441. Epub 2014 Jan 30.
The increasing availability of electronic health records (EHRs) creates opportunities for automated extraction of information from clinical text. We hypothesized that natural language processing (NLP) could substantially reduce the burden of manual abstraction in studies examining outcomes, like cancer recurrence, that are documented in unstructured clinical text, such as progress notes, radiology reports, and pathology reports. We developed an NLP-based system using open-source software to process electronic clinical notes from 1995 to 2012 for women with early-stage incident breast cancers to identify whether and when recurrences were diagnosed. We developed and evaluated the system using clinical notes from 1,472 patients receiving EHR-documented care in an integrated health care system in the Pacific Northwest. A separate study provided the patient-level reference standard for recurrence status and date. The NLP-based system correctly identified 92% of recurrences and estimated diagnosis dates within 30 days for 88% of these. Specificity was 96%. The NLP-based system overlooked 5 of 65 recurrences, 4 because electronic documents were unavailable. The NLP-based system identified 5 other recurrences incorrectly classified as nonrecurrent in the reference standard. If used in similar cohorts, NLP could reduce by 90% the number of EHR charts abstracted to identify confirmed breast cancer recurrence cases at a rate comparable to traditional abstraction.
电子健康记录(EHR)的日益普及为从临床文本中自动提取信息创造了机会。我们假设自然语言处理(NLP)可以大大减轻手动提取信息的负担,这些信息是在未结构化的临床文本中记录的,例如进展记录、放射学报告和病理学报告,用于研究癌症复发等结果。我们使用开源软件开发了一个基于 NLP 的系统,用于处理 1995 年至 2012 年间患有早期乳腺癌的女性的电子临床记录,以确定是否以及何时诊断出复发。我们使用西北太平洋地区一个综合医疗保健系统中接受 EHR 记录护理的 1472 名患者的临床记录来开发和评估该系统。一项单独的研究提供了患者级别的复发状态和日期的参考标准。基于 NLP 的系统正确识别了 92%的复发病例,并在 88%的情况下准确估计了 30 天内的诊断日期。特异性为 96%。基于 NLP 的系统忽略了 65 例复发中的 5 例,其中 4 例是因为电子文档不可用。基于 NLP 的系统错误地将 5 例其他复发病例归类为非复发病例,而这些病例在参考标准中被归类为复发病例。如果在类似的队列中使用,NLP 可以将需要手动提取以确认乳腺癌复发病例的 EHR 图表数量减少 90%,并且其速度可与传统提取方法相媲美。