Department of Medicine, Stanford University School of Medicine, Stanford, CA.
Department of Epidemiology and Population Health, Stanford University School of Medicine, Stanford, CA.
JCO Clin Cancer Inform. 2021 Apr;5:469-478. doi: 10.1200/CCI.20.00165.
Large-scale analysis of real-world evidence is often limited to structured data fields that do not contain reliable information on recurrence status and disease sites. In this report, we describe a natural language processing (NLP) framework that uses data from free-text, unstructured reports to classify recurrence status and sites of recurrence for patients with breast and hepatocellular carcinomas (HCC).
Using two cohorts of breast cancer and HCC cases, we validated the ability of a previously developed NLP model to distinguish between no recurrence, local recurrence, and distant recurrence, based on clinician notes, radiology reports, and pathology reports compared with manual curation. A second NLP model was trained and validated to identify sites of recurrence. We compared the ability of each NLP model to identify the presence, timing, and site of recurrence, when compared against manual chart review and International Classification of Diseases coding.
A total of 1,273 patients were included in the development and validation of the two models. The NLP model for recurrence detects distant recurrence with an area under the curve of 0.98 (95% CI, 0.96 to 0.99) and 0.95 (95% CI, 0.88 to 0.98) in breast and HCC cohorts, respectively. The mean accuracy of the NLP model for detecting any site of distant recurrence was 0.9 for breast cancer and 0.83 for HCC. The NLP model for recurrence identified a larger proportion of patients with distant recurrence in a breast cancer database (11.1%) compared with International Classification of Diseases coding (2.31%).
We developed two NLP models to identify distant cancer recurrence, timing of recurrence, and sites of recurrence based on unstructured electronic health record data. These models can be used to perform large-scale retrospective studies in oncology.
真实世界证据的大规模分析通常仅限于结构化数据字段,这些字段不包含有关复发状态和疾病部位的可靠信息。在本报告中,我们描述了一种自然语言处理(NLP)框架,该框架使用来自自由文本、非结构化报告的数据来对乳腺癌和肝细胞癌(HCC)患者的复发状态和复发部位进行分类。
使用两个乳腺癌和 HCC 病例队列,我们验证了先前开发的 NLP 模型根据临床医生的笔记、放射学报告和病理学报告,与手动策展相比,区分无复发、局部复发和远处复发的能力。还训练和验证了第二个 NLP 模型以识别复发部位。我们比较了每个 NLP 模型在与手动图表审查和国际疾病分类编码相比时,识别复发的存在、时间和部位的能力。
共有 1273 名患者被纳入两个模型的开发和验证中。用于复发的 NLP 模型在乳腺癌和 HCC 队列中检测远处复发的曲线下面积分别为 0.98(95%CI,0.96 至 0.99)和 0.95(95%CI,0.88 至 0.98)。用于检测任何远处复发部位的 NLP 模型的平均准确率分别为乳腺癌的 0.9 和 HCC 的 0.83。与国际疾病分类编码(2.31%)相比,用于复发的 NLP 模型在乳腺癌数据库中识别出更多远处复发的患者(11.1%)。
我们开发了两种 NLP 模型,以根据非结构化电子健康记录数据识别远处癌症复发、复发时间和复发部位。这些模型可用于在肿瘤学中进行大规模回顾性研究。