Division of Cancer Control and Population Sciences, National Cancer Institute, NIH, Rockville, MD, USA.
Information Management Services, Inc., Calverton, MD, USA.
Int J Med Inform. 2023 Sep;177:105157. doi: 10.1016/j.ijmedinf.2023.105157. Epub 2023 Jul 17.
The National Cancer Institute (NCI) conducts Patterns of Care (POC) studies for selected cancer sites under a Congressional Mandate. These studies aim to collect treatment information beyond what is typically collected by the NCI's Surveillance, Epidemiology, and End Results (SEER) Program. The 2019 POC study focused on non-small cell lung cancer (NSCLC) and melanoma cancer sites. For the NSCLC cases, one of the primary sampling objectives was to oversample patients who tested positive for EGFR/ALK mutations, but initial information on mutation test results was unavailable prior to selecting the study sample.
To address this, text mining algorithms were developed to screen all eligible NSCLC cases from the SEER database. These algorithms were designed to identify the mutation test status, allowing for stratified sampling based on SEER registry, sex, race/ethnicity, and tumor mutation test results.
The final NSCLC sample included 2,434 patients aged 20+ with advanced stage (IIIB-IVB) NSCLC diagnosed in 2017 and 2018. Among this sample, 692 cases (13.2%) tested positive for EGFR/ALK mutations. An evaluation of the text mining algorithms performance, based on cases where both algorithm results and known EGFR/ALK status from medical chart abstraction were available, showed good results: sensitivity of 77.6%, specificity of 90.8%, and an overall accuracy 84.8%.
The adaption of text mining algorithm proved effective in oversample patients with uncommon conditions in studies where electronic medical records are accessible. The 2019 POC study provides valuable data for researchers to evaluate cancer therapy details and patient characteristics, particularly among those with EGFR/ALK test positive cases.
美国国家癌症研究所(NCI)根据国会授权,对选定的癌症部位进行模式护理(POC)研究。这些研究旨在收集超出 NCI 监测、流行病学和最终结果(SEER)计划通常收集的治疗信息。2019 年 POC 研究集中在非小细胞肺癌(NSCLC)和黑色素瘤癌症部位。对于 NSCLC 病例,主要抽样目标之一是对 EGFR/ALK 突变检测呈阳性的患者进行过采样,但在选择研究样本之前,最初没有关于突变检测结果的信息。
为了解决这个问题,开发了文本挖掘算法来从 SEER 数据库中筛选所有符合条件的 NSCLC 病例。这些算法旨在确定突变检测状态,允许根据 SEER 注册、性别、种族/民族和肿瘤突变检测结果进行分层抽样。
最终的 NSCLC 样本包括 2434 名年龄在 20 岁及以上、2017 年和 2018 年诊断为晚期(IIIB-IVB)NSCLC 的患者。在这个样本中,692 例(13.2%)EGFR/ALK 突变检测呈阳性。基于算法结果和从病历摘要中获得的已知 EGFR/ALK 状态都可用的病例,对文本挖掘算法性能进行了评估,结果显示出良好的效果:敏感性为 77.6%,特异性为 90.8%,总准确率为 84.8%。
在可以访问电子病历的研究中,文本挖掘算法的适应性证明在对罕见情况的患者进行过采样方面是有效的。2019 年 POC 研究为研究人员提供了有价值的数据,用于评估癌症治疗细节和患者特征,特别是在 EGFR/ALK 检测阳性的患者中。