Dana-Farber Cancer Institute, Boston, MA.
Harvard Medical School, Boston, MA.
JCO Clin Cancer Inform. 2024 May;8:e2300269. doi: 10.1200/CCI.23.00269.
Eastern Cooperative Oncology Group (ECOG) performance status (PS) is a key clinical variable for cancer treatment and research, but it is usually only recorded in unstructured form in the electronic health record. We investigated whether natural language processing (NLP) models can impute ECOG PS using unstructured note text.
Medical oncology notes were identified from all patients with cancer at our center from 1997 to 2023 and divided at the patient level into training (approximately 80%), tuning/validation (approximately 10%), and test (approximately 10%) sets. Regular expressions were used to extract explicitly documented PS. Extracted PS labels were used to train NLP models to impute ECOG PS (0-1 2-4) from the remainder of the notes (with regular expression-extracted PS documentation removed). We assessed associations between imputed PS and overall survival (OS).
ECOG PS was extracted using regular expressions from 495,862 notes, corresponding to 79,698 patients. A Transformer-based Longformer model imputed PS with high discrimination (test set area under the receiver operating characteristic curve 0.95, area under the precision-recall curve 0.73). Imputed poor PS was associated with worse OS, including among notes with no explicit documentation of PS detected (OS hazard ratio, 11.9; 95% CI, 11.1 to 12.8).
NLP models can be used to impute performance status from unstructured oncologist notes at scale. This may aid the annotation of oncology data sets for clinical outcomes research and cancer care delivery.
东部肿瘤协作组(ECOG)体能状态(PS)是癌症治疗和研究的关键临床变量,但它通常仅在电子健康记录中以非结构化形式记录。我们研究了自然语言处理(NLP)模型是否可以使用非结构化的笔记文本推断 ECOG PS。
从我们中心 1997 年至 2023 年所有癌症患者的医疗肿瘤学笔记中确定了患者,并在患者水平上分为训练(约 80%)、调整/验证(约 10%)和测试(约 10%)集。使用正则表达式从笔记中提取明确记录的 PS。提取的 PS 标签用于训练 NLP 模型,以从其余笔记中推断 ECOG PS(0-1 2-4)(去除正则表达式提取的 PS 文档)。我们评估了推断的 PS 与总生存(OS)之间的关联。
使用正则表达式从 495,862 份笔记中提取了 ECOG PS,对应 79,698 名患者。基于 Transformer 的 Longformer 模型以高判别力推断 PS(测试集受试者工作特征曲线下面积 0.95,精确召回曲线下面积 0.73)。推断的不良 PS 与较差的 OS 相关,包括在未检测到明确 PS 记录的笔记中(OS 风险比,11.9;95%CI,11.1 至 12.8)。
NLP 模型可用于从非结构化肿瘤学家笔记中大规模推断 PS。这可能有助于注释临床结果研究和癌症护理提供的肿瘤学数据集。