应用计算机文本挖掘算法对 NCI 护理模式研究中的医疗记录进行肿瘤突变状态的过采样。

Applying computer text mining algorithms for oversampling tumor mutation status in medical records for the NCI Patterns of Care studies.

机构信息

Division of Cancer Control and Population Sciences, National Cancer Institute, NIH, Rockville, MD, USA.

Information Management Services, Inc., Calverton, MD, USA.

出版信息

Int J Med Inform. 2023 Sep;177:105157. doi: 10.1016/j.ijmedinf.2023.105157. Epub 2023 Jul 17.

DOI:10.1016/j.ijmedinf.2023.105157

PMID:37480595

Abstract

BACKGROUNDS

The National Cancer Institute (NCI) conducts Patterns of Care (POC) studies for selected cancer sites under a Congressional Mandate. These studies aim to collect treatment information beyond what is typically collected by the NCI's Surveillance, Epidemiology, and End Results (SEER) Program. The 2019 POC study focused on non-small cell lung cancer (NSCLC) and melanoma cancer sites. For the NSCLC cases, one of the primary sampling objectives was to oversample patients who tested positive for EGFR/ALK mutations, but initial information on mutation test results was unavailable prior to selecting the study sample.

METHODS

To address this, text mining algorithms were developed to screen all eligible NSCLC cases from the SEER database. These algorithms were designed to identify the mutation test status, allowing for stratified sampling based on SEER registry, sex, race/ethnicity, and tumor mutation test results.

RESULTS

The final NSCLC sample included 2,434 patients aged 20+ with advanced stage (IIIB-IVB) NSCLC diagnosed in 2017 and 2018. Among this sample, 692 cases (13.2%) tested positive for EGFR/ALK mutations. An evaluation of the text mining algorithms performance, based on cases where both algorithm results and known EGFR/ALK status from medical chart abstraction were available, showed good results: sensitivity of 77.6%, specificity of 90.8%, and an overall accuracy 84.8%.

CONCLUSIONS

The adaption of text mining algorithm proved effective in oversample patients with uncommon conditions in studies where electronic medical records are accessible. The 2019 POC study provides valuable data for researchers to evaluate cancer therapy details and patient characteristics, particularly among those with EGFR/ALK test positive cases.

摘要

背景

美国国家癌症研究所（NCI）根据国会授权，对选定的癌症部位进行模式护理（POC）研究。这些研究旨在收集超出 NCI 监测、流行病学和最终结果（SEER）计划通常收集的治疗信息。2019 年 POC 研究集中在非小细胞肺癌（NSCLC）和黑色素瘤癌症部位。对于 NSCLC 病例，主要抽样目标之一是对 EGFR/ALK 突变检测呈阳性的患者进行过采样，但在选择研究样本之前，最初没有关于突变检测结果的信息。

方法

为了解决这个问题，开发了文本挖掘算法来从 SEER 数据库中筛选所有符合条件的 NSCLC 病例。这些算法旨在确定突变检测状态，允许根据 SEER 注册、性别、种族/民族和肿瘤突变检测结果进行分层抽样。

结果

最终的 NSCLC 样本包括 2434 名年龄在 20 岁及以上、2017 年和 2018 年诊断为晚期（IIIB-IVB）NSCLC 的患者。在这个样本中，692 例（13.2%）EGFR/ALK 突变检测呈阳性。基于算法结果和从病历摘要中获得的已知 EGFR/ALK 状态都可用的病例，对文本挖掘算法性能进行了评估，结果显示出良好的效果：敏感性为 77.6%，特异性为 90.8%，总准确率为 84.8%。