Department of Management Science and Engineering, Huang Engineering Center, Stanford, CA.
Department of Biomedical Informatics, Department of Radiology, Emory University School of Medicine, Atlanta, GA.
JCO Clin Cancer Inform. 2021 Apr;5:379-393. doi: 10.1200/CCI.20.00173.
Knowing the treatments administered to patients with cancer is important for treatment planning and correlating treatment patterns with outcomes for personalized medicine study. However, existing methods to identify treatments are often lacking. We develop a natural language processing approach with structured electronic medical records and unstructured clinical notes to identify the initial treatment administered to patients with cancer.
We used a total number of 4,412 patients with 483,782 clinical notes from the Stanford Cancer Institute Research Database containing patients with nonmetastatic prostate, oropharynx, and esophagus cancer. We trained treatment identification models for each cancer type separately and compared performance of using only structured, only unstructured (, , ), and combinations of both (, , ). We optimized the identification model among five machine learning methods (logistic regression, multilayer perceptrons, random forest, support vector machines, and stochastic gradient boosting). The treatment information recorded in the cancer registry is the gold standard and compares our methods to an identification baseline with billing codes.
For prostate cancer, we achieved an f1-score of 0.99 (95% CI, 0.97 to 1.00) for radiation and 1.00 (95% CI, 0.99 to 1.00) for surgery using . For oropharynx cancer, we achieved an f1-score of 0.78 (95% CI, 0.58 to 0.93) for chemoradiation and 0.83 (95% CI, 0.69 to 0.95) for surgery using . For esophagus cancer, we achieved an f1-score of 1.0 (95% CI, 1.0 to 1.0) for both chemoradiation and surgery using all combinations of structured and unstructured data. We found that employing the free-text clinical notes outperforms using the billing codes or only structured data for all three cancer types.
Our results show that treatment identification using free-text clinical notes greatly improves upon the performance using billing codes and simple structured data. The approach can be used for treatment cohort identification and adapted for longitudinal cancer treatment identification.
了解癌症患者的治疗方法对于治疗计划和将治疗模式与个性化医学研究的结果相关联非常重要。然而,现有的治疗方法往往存在不足。我们开发了一种自然语言处理方法,结合结构化电子病历和非结构化临床记录,以确定癌症患者的初始治疗方法。
我们使用了来自斯坦福癌症研究所研究数据库的 4412 名患者和 483782 份临床记录,这些患者患有非转移性前列腺癌、口咽癌和食管癌。我们分别为每种癌症类型训练了治疗识别模型,并比较了仅使用结构化数据、仅使用非结构化数据(、、)以及两者结合(、、)的性能。我们在五种机器学习方法(逻辑回归、多层感知机、随机森林、支持向量机和随机梯度提升)中对识别模型进行了优化。癌症登记处记录的治疗信息是金标准,将我们的方法与使用计费代码的识别基线进行了比较。
对于前列腺癌,我们使用 分别获得了放射治疗的 f1 得分为 0.99(95%置信区间,0.97 至 1.00)和手术的 1.00(95%置信区间,0.99 至 1.00)。对于口咽癌,我们使用 分别获得了放化疗的 f1 得分为 0.78(95%置信区间,0.58 至 0.93)和手术的 0.83(95%置信区间,0.69 至 0.95)。对于食管癌,我们使用所有结构化和非结构化数据的组合,分别获得了放化疗和手术的 f1 得分为 1.0(95%置信区间,1.0 至 1.0)。我们发现,对于所有三种癌症类型,使用自由文本临床记录比使用计费代码或仅使用结构化数据的效果更好。
我们的研究结果表明,使用自由文本临床记录进行治疗识别可以大大提高使用计费代码和简单结构化数据的性能。该方法可用于治疗队列识别,并可适用于纵向癌症治疗识别。