Lee Kyeryoung, Liu Zongzhi, Huang Qing, Corrigan David, Kalsekar Iftekhar, Jun Tomi, Stolovitzky Gustavo, Oh William K, Rajaram Ravi, Wang Xiaoyan
GeneDx (Sema4), Stamford, CT.
Lung Cancer Initiative, Johnson & Johnson, New Brunswick, NJ.
JCO Clin Cancer Inform. 2025 Apr;9:e2400227. doi: 10.1200/CCI-24-00227. Epub 2025 Apr 18.
Recurrences after curative resection in early-stage and locoregionally advanced non-small cell lung cancer (NSCLC) are common, necessitating a nuanced understanding of associated risk factors. This study aimed to establish a natural language processing (NLP) system to efficiently curate recurrence data in NSCLC and analyze risk factors longitudinally.
Electronic health records of 6,351 patients with NSCLC with >700,000 notes were obtained from Mount Sinai's data sets. A deep learning-based customized NLP system was developed to identify cohorts experiencing recurrence. Recurrence types and rates over time were stratified by various clinical features. Cohort description analysis, Kaplan-Meier analysis for overall recurrence-free survival (RFS) and distant metastasis-free survival (DMFS), and Cox proportional hazards analysis were performed.
Of 1,295 patients with stage I-IIIA NSCLC with surgical resections, 336 patients (25.9%) experienced recurrence, as identified through NLP. The NLP system achieved a precision of 94.3%, a recall of 93%, and an F1 score of 93.5. Among 336 patients, 52.4% had local/regional recurrences, 44% distant metastases, and 3.6% unknown recurrence. RFS rates at years 1-5 were 93%, 81%, 73%, 67%, and 61%, respectively (96%, 89%, 84%, 80%, and 75% for distant metastasis). Stage-specific RFS rates at year 5 were 73% (IA), 62% (IB), 47% (IIA), 46% (IIB), and 20% (IIIA). Stage IB patients had a significantly higher likelihood of recurrence versus stage IA (adjusted hazard ratio [aHR], 1.63; = .02). The RFS was lower in patients with clinically significant alteration ( -negative or unknown significance), affecting overall RFS (aHR, 1.89; = .007) and DMFS (aHR, 2.47; = .009) among stage IA/IB patients.
Our scalable NLP system enabled us to generate real-world insights into NSCLC recurrences, paving the way for predictive models for preventing, diagnosing, and treating NSCLC recurrence.
早期和局部晚期非小细胞肺癌(NSCLC)根治性切除术后复发很常见,因此有必要对相关风险因素有细致入微的了解。本研究旨在建立一个自然语言处理(NLP)系统,以高效整理NSCLC的复发数据并纵向分析风险因素。
从西奈山数据集获取了6351例NSCLC患者的电子健康记录,记录数量超过70万条。开发了一个基于深度学习的定制NLP系统,以识别经历复发的队列。根据各种临床特征对复发类型和随时间的复发率进行分层。进行了队列描述分析、总体无复发生存期(RFS)和无远处转移生存期(DMFS)的Kaplan-Meier分析,以及Cox比例风险分析。
在1295例接受手术切除的I-IIIA期NSCLC患者中,通过NLP识别出336例(25.9%)患者出现复发。NLP系统的精确率为94.3%,召回率为93%,F1评分为93.5%。在336例患者中,52.4%有局部/区域复发,44%有远处转移,3.6%复发情况不明。1-5年的RFS率分别为93%、81%、73%、67%和61%(远处转移的RFS率分别为96%、89%、84%、80%和75%)。5年时各分期的RFS率分别为73%(IA期)、62%(IB期)、47%(IIA期)、46%(IIB期)和20%(IIIA期)。IB期患者与IA期相比复发可能性显著更高(调整后风险比[aHR],1.63;P = 0.02)。在IA/IB期患者中,具有临床显著改变(KRAS阴性或意义不明)的患者RFS较低,这影响了总体RFS(aHR,1.89;P = 0.007)和DMFS(aHR,2.47;P = 0.009)。
我们可扩展的NLP系统使我们能够对NSCLC复发产生真实世界的见解,为预防、诊断和治疗NSCLC复发的预测模型铺平了道路。