California Cancer Reporting and Epidemiologic Surveillance Program, Institute for Population Health Improvement, University of California Davis Health, Sacramento, California, United States of America.
University of California Davis, Graduate Group in Epidemiology, Davis, California, United States of America.
PLoS One. 2019 Feb 22;14(2):e0212454. doi: 10.1371/journal.pone.0212454. eCollection 2019.
Population-based cancer registries have treatment information for all patients making them an excellent resource for population-level monitoring. However, specific treatment details, such as drug names, are contained in a free-text format that is difficult to process and summarize. We assessed the accuracy and efficiency of a text-mining algorithm to identify systemic treatments for lung cancer from free-text fields in the California Cancer Registry.
The algorithm used Perl regular expressions in SAS 9.4 to search for treatments in 24,845 free-text records associated with 17,310 patients in California diagnosed with stage IV non-small cell lung cancer between 2012 and 2014. Our algorithm categorized treatments into six groups that align with National Comprehensive Cancer Network guidelines. We compared results to a manual review (gold standard) of the same records.
Percent agreement ranged from 91.1% to 99.4%. Ranges for other measures were 0.71-0.92 (Kappa), 74.3%-97.3% (sensitivity), 92.4%-99.8% (specificity), 60.4%-96.4% (positive predictive value), and 92.9%-99.9% (negative predictive value). The text-mining algorithm used one-sixth of the time required for manual review.
SAS-based text mining of free-text data can accurately detect systemic treatments administered to patients and save considerable time compared to manual review, maximizing the utility of the extant information in population-based cancer registries for comparative effectiveness research.
基于人群的癌症登记处拥有所有患者的治疗信息,使其成为人群水平监测的绝佳资源。然而,特定的治疗细节,如药物名称,包含在难以处理和总结的自由文本格式中。我们评估了一种文本挖掘算法的准确性和效率,该算法用于从加利福尼亚癌症登记处的自由文本字段中识别肺癌的全身治疗方法。
该算法使用 SAS 9.4 中的 Perl 正则表达式,在加利福尼亚州 2012 年至 2014 年间诊断为 IV 期非小细胞肺癌的 17,310 名患者的 24,845 份自由文本记录中搜索治疗方法。我们的算法将治疗方法分为与国家综合癌症网络指南一致的六组。我们将结果与对同一记录的手动审查(黄金标准)进行了比较。
百分比一致性范围从 91.1%到 99.4%。其他指标的范围为 0.71-0.92(kappa)、74.3%-97.3%(灵敏度)、92.4%-99.8%(特异性)、60.4%-96.4%(阳性预测值)和 92.9%-99.9%(阴性预测值)。与手动审查相比,文本挖掘算法的使用时间仅为手动审查所需时间的六分之一。
基于 SAS 的自由文本数据的文本挖掘可以准确地检测到患者接受的全身治疗方法,并与手动审查相比节省大量时间,从而最大限度地提高基于人群的癌症登记处现有信息在比较有效性研究中的效用。