Dana-Farber Cancer Institute, Boston, MA.
Harvard Medical School, Boston, MA.
JCO Clin Cancer Inform. 2020 Aug;4:680-690. doi: 10.1200/CCI.20.00020.
Cancer research using electronic health records and genomic data sets requires clinical outcomes data, which may be recorded only in unstructured text by treating oncologists. Natural language processing (NLP) could substantially accelerate extraction of this information.
Patients with lung cancer who had tumor sequencing as part of a single-institution precision oncology study from 2013 to 2018 were identified. Medical oncologists' progress notes for these patients were reviewed. For each note, curators recorded whether the assessment/plan indicated any cancer, progression/worsening of disease, and/or response to therapy or improving disease. Next, a recurrent neural network was trained using unlabeled notes to extract the assessment/plan from each note. Finally, convolutional neural networks were trained on labeled assessments/plans to predict the probability that each curated outcome was present. Model performance was evaluated using the area under the receiver operating characteristic curve (AUROC) among a held-out test set of 10% of patients. Associations between curated response or progression end points and overall survival were measured using Cox models among patients receiving palliative-intent systemic therapy.
Medical oncologist notes (n = 7,597) were manually curated for 919 patients. In the 10% test set, NLP models replicated human curation with AUROCs of 0.94 for the any-cancer outcome, 0.86 for the progression outcome, and 0.90 for the response outcome. Progression/worsening events identified using NLP models were associated with shortened survival (hazard ratio [HR] for mortality, 2.49; 95% CI, 2.00 to 3.09); response/improvement events were associated with improved survival (HR, 0.45; 95% CI, 0.30 to 0.67).
NLP models based on neural networks can extract meaningful outcomes from oncologist notes at scale. Such models may facilitate identification of clinical and genomic features associated with response to cancer treatment.
使用电子健康记录和基因组数据集进行癌症研究需要临床结果数据,这些数据可能仅由治疗肿瘤学家以非结构化文本的形式记录。自然语言处理(NLP)可以大大加快提取这些信息的速度。
确定了 2013 年至 2018 年间在一个机构进行精准肿瘤学研究中接受肿瘤测序的肺癌患者。审查了这些患者的肿瘤内科医生的进展记录。对于每一份记录,管理员都记录了评估/计划是否表明存在任何癌症、疾病进展/恶化以及/或对治疗的反应或改善疾病。接下来,使用未标记的记录训练一个递归神经网络,从每个记录中提取评估/计划。最后,在标记的评估/计划上训练卷积神经网络,以预测每个已审核结果存在的概率。使用 10%患者的测试集的接收者操作特征曲线(AUROC)评估模型性能。在接受姑息性系统治疗的患者中,使用 Cox 模型测量已审核的反应或进展终点与总生存期之间的关联。
为 919 名患者手动审核了肿瘤内科医生的记录(n=7597)。在 10%的测试集中,NLP 模型的 AUROC 为 0.94,用于任何癌症结果;0.86,用于进展结果;0.90,用于反应结果,复制了人类的审核结果。使用 NLP 模型识别的进展/恶化事件与缩短的生存期相关(死亡风险比 [HR],2.49;95%置信区间 [CI],2.00 至 3.09);反应/改善事件与生存期改善相关(HR,0.45;95%CI,0.30 至 0.67)。
基于神经网络的 NLP 模型可以大规模地从肿瘤内科医生的记录中提取有意义的结果。这种模型可能有助于识别与癌症治疗反应相关的临床和基因组特征。