Goulart Bernardo Haddock Lobo, Silgard Emily T, Baik Christina S, Bansal Aasthaa, Sun Qin, Durbin Eric B, Hands Isaac, Shah Darshil, Arnold Susanne M, Ramsey Scott D, Kavuluru Ramakanth, Schwartz Stephen M
Fred Hutchinson Cancer Research Center, Seattle, WA.
University of Washington, Seattle, WA.
JCO Clin Cancer Inform. 2019 May;3:1-15. doi: 10.1200/CCI.18.00098.
SEER registries do not report results of epidermal growth factor receptor () and anaplastic lymphoma kinase () mutation tests. To facilitate population-based research in molecularly defined subgroups of non-small-cell lung cancer (NSCLC), we assessed the validity of natural language processing (NLP) for the ascertainment of EGFR and ALK testing from electronic pathology (e-path) reports of NSCLC cases included in two SEER registries: the Cancer Surveillance System (CSS) and the Kentucky Cancer Registry (KCR).
We obtained 4,278 e-path reports from 1,634 patients who were diagnosed with stage IV nonsquamous NSCLC from September 1, 2011, to December 31, 2013, included in CSS. We used 855 CSS reports to train NLP systems for the ascertainment of and test status (reported not reported) and test results (positive negative). We assessed sensitivity, specificity, and positive and negative predictive values in an internal validation sample of 3,423 CSS e-path reports and repeated the analysis in an external sample of 1,041 e-path reports from 565 KCR patients. Two oncologists manually reviewed all e-path reports to generate gold-standard data sets.
NLP systems yielded internal validity metrics that ranged from 0.95 to 1.00 for and test status and results in CSS e-path reports. NLP showed high internal accuracy for the ascertainment of and in CSS patients-F scores of 0.95 and 0.96, respectively. In the external validation analysis, NLP yielded metrics that ranged from 0.02 to 0.96 in KCR reports and F scores of 0.70 and 0.72, respectively, in KCR patients.
NLP is an internally valid method for the ascertainment of and test information from e-path reports available in SEER registries, but future work is necessary to increase NLP external validity.
监测、流行病学与最终结果(SEER)登记处未报告表皮生长因子受体(EGFR)和间变性淋巴瘤激酶(ALK)突变检测结果。为推动基于人群的非小细胞肺癌(NSCLC)分子定义亚组研究,我们评估了自然语言处理(NLP)从两个SEER登记处(癌症监测系统(CSS)和肯塔基癌症登记处(KCR))纳入的NSCLC病例的电子病理(e-path)报告中确定EGFR和ALK检测的有效性。
我们从2011年9月1日至2013年12月31日被诊断为IV期非鳞状NSCLC的1634例患者中获取了4278份e-path报告,这些患者纳入了CSS。我们使用855份CSS报告来训练NLP系统,以确定EGFR和ALK检测状态(报告/未报告)和检测结果(阳性/阴性)。我们在3423份CSS e-path报告的内部验证样本中评估了敏感性、特异性以及阳性和阴性预测值,并在来自565例KCR患者的1041份e-path报告的外部样本中重复了分析。两名肿瘤学家人工审查了所有e-path报告以生成金标准数据集。
NLP系统在CSS e-path报告中得出的EGFR和ALK检测状态及结果的内部有效性指标范围为0.95至1.00。NLP在CSS患者中确定EGFR和ALK时显示出较高的内部准确性——F分数分别为0.95和0.96。在外部验证分析中,NLP在KCR报告中得出的指标范围为0.02至0.96,在KCR患者中的F分数分别为0.70和0.72。
NLP是从SEER登记处可用的e-path报告中确定EGFR和ALK检测信息的一种内部有效的方法,但未来有必要开展工作以提高NLP的外部有效性。