Yang Shuang, Huang Yu, Lou Xiwei, Lyu Tianchen, Wei Ruoqi, Mehta Hiren J, Wu Yonghui, Alvarado Michelle, Salloum Ramzi G, Braithwaite Dejana, Huo Jinhai, Shih Ya-Chen Tina, Guo Yi, Bian Jiang
Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL.
Cancer Informatics Shared Resource, University of Florida Health Cancer Center, Gainesville, FL.
JCO Clin Cancer Inform. 2025 Jan;9:e2400139. doi: 10.1200/CCI.24.00139. Epub 2025 Jan 16.
Lung cancer screening (LCS) has the potential to reduce mortality and detect lung cancer at its early stages, but the high false-positive rate associated with low-dose computed tomography (LDCT) for LCS acts as a barrier to its widespread adoption. This study aims to develop computable phenotype (CP) algorithms on the basis of electronic health records (EHRs) to identify individual's eligibility for LCS, thereby enhancing LCS utilization in real-world settings.
The study cohort included 5,778 individuals who underwent LDCT for LCS from 2012 to 2022, as recorded in the University of Florida Health Integrated Data Repository. CP rules derived from LCS guidelines were used to identify potential candidates, incorporating both structured EHR and clinical notes analyzed via natural language processing. We then conducted manual reviews of 453 randomly selected charts to refine and validate these rules, assessing CP performance using metrics, for example, F1 score, specificity, and sensitivity.
We developed an optimal CP rule that integrates both structured and unstructured data, adhering to the US Preventive Services Task Force 2013 and 2020 guidelines. This rule focuses on age (55-80 years for 2013 and 50-80 years for 2020), smoking status (current, former, and others), and pack-years (≥30 for 2013 and ≥20 for 2020), achieving F1 scores of 0.75 and 0.84 for the respective guidelines. Including unstructured data improved the F1 score performance by up to 9.2% for 2013 and 12.9% for 2020, compared with using structured data alone.
Our findings underscore the critical need for improved documentation of smoking information in EHRs, demonstrate the value of artificial intelligence techniques in enhancing CP performance, and confirm the effectiveness of EHR-based CP in identifying LCS-eligible individuals. This supports its potential to aid clinical decision making and optimize patient care.
肺癌筛查(LCS)有潜力降低死亡率并在肺癌早期阶段进行检测,但低剂量计算机断层扫描(LDCT)用于LCS时的高假阳性率成为其广泛应用的障碍。本研究旨在基于电子健康记录(EHRs)开发可计算表型(CP)算法,以确定个体是否适合进行LCS,从而提高LCS在现实环境中的利用率。
研究队列包括2012年至2022年在佛罗里达大学健康综合数据存储库中记录的5778名接受LDCT进行LCS的个体。从LCS指南中得出的CP规则用于识别潜在候选人,纳入结构化EHR和通过自然语言处理分析的临床笔记。然后,我们对453份随机选择的病历进行人工审核,以完善和验证这些规则,使用F1分数、特异性和敏感性等指标评估CP性能。
我们制定了一个整合结构化和非结构化数据的最佳CP规则,遵循美国预防服务工作组2013年和2020年的指南。该规则侧重于年龄(2013年为55 - 80岁,2020年为50 - 80岁)、吸烟状况(当前吸烟者、既往吸烟者和其他情况)以及吸烟包年数(2013年≥30,2020年≥20),对于各自的指南,F1分数分别达到0.75和0.84。与仅使用结构化数据相比,纳入非结构化数据使2013年的F1分数性能提高了9.2%,2020年提高了12.9%。
我们的研究结果强调了在EHRs中改进吸烟信息记录的迫切需求,证明了人工智能技术在提高CP性能方面的价值,并证实了基于EHR的CP在识别适合LCS的个体方面的有效性。这支持了其在辅助临床决策和优化患者护理方面的潜力。