Adelson School of Medicine, Ariel University, Ariel, Israel.
Department of Mathematics, Ariel University, Ariel, Israel; Department of Cancer Biology, Cancer Institute, University College London, London, UK.
Cancer Epidemiol. 2024 Oct;92:102631. doi: 10.1016/j.canep.2024.102631. Epub 2024 Jul 24.
Lung cancer (LC) screening using low-dose computed tomography (CT) is recommended according to standard risk criteria or personalized risk calculators. Machine learning (ML) models that can predict disease risk are an emerging method in medicine for identifying hidden associations that are personally unique.
Using the tree-based pipeline optimization tool (TPOT), we developed an ML-based model, which is an ensemble of the Random Forest and XGboost models, based on known risk factors for LC, as part of a larger trial for ML prediction using electronic medical records and chest CT. We used data from patients with LC vs. controls (1:2) of patients aged ≥ 35 years. We developed a model for all LC patients as well as for patients with and without a smoking background. We included age, gender, body mass index (BMI), smoking history, socioeconomic status (SES), history of chronic obstructive pulmonary disease (COPD)/emphysema/chronic bronchitis (CB), interstitial lung disease (ILD)/pulmonary fibrosis (PF), and family history of LC.
Of the 4076 patients, 1428 (35 %) were in the LC group and 2648 (65 %) were in the control group. For the entire study population, our model achieved an accuracy of 71.2 %, with a sensitivity of 69 % and a positive predictive value (PPV) of 74 %. Higher accuracy was achieved for the two subgroups. An accuracy of 74.8 % (sensitivity 72 %, PPV 76 %) and 73.0 % (sensitivity 76 %, PPV 72 %) was achieved for the smoking and never-smoking cohorts, respectively. For the entire population and smoker cohort, COPD/emphysema/CB were the most important contributors, followed by BMI and age, while in the never-smoking cohort, BMI, age and SES were the most important contributors.
Known risk factors for LC could be used in ML models to modestly predict LC. Further studies are needed to confirm these results in new patients and to improve them.
根据标准风险标准或个性化风险计算器,建议使用低剂量计算机断层扫描(CT)进行肺癌(LC)筛查。机器学习(ML)模型可以预测疾病风险,这是医学中一种新兴的方法,用于识别个人独特的隐藏关联。
使用基于树的管道优化工具(TPOT),我们基于 LC 的已知风险因素开发了一个基于 ML 的模型,该模型是随机森林和 XGboost 模型的集合,作为使用电子病历和胸部 CT 进行 ML 预测的更大试验的一部分。我们使用了 LC 患者与对照(1:2)的患者数据,年龄≥35 岁。我们为所有 LC 患者以及有和没有吸烟背景的患者开发了一个模型。我们包括年龄、性别、体重指数(BMI)、吸烟史、社会经济地位(SES)、慢性阻塞性肺疾病(COPD)/肺气肿/慢性支气管炎(CB)、间质性肺病(ILD)/肺纤维化(PF)和 LC 家族史。
在 4076 名患者中,1428 名(35%)为 LC 组,2648 名(65%)为对照组。对于整个研究人群,我们的模型达到了 71.2%的准确率,敏感性为 69%,阳性预测值(PPV)为 74%。两个亚组的准确率更高。吸烟和从不吸烟队列的准确率分别为 74.8%(敏感性 72%,PPV 76%)和 73.0%(敏感性 76%,PPV 72%)。对于整个人群和吸烟者队列,COPD/肺气肿/CB 是最重要的贡献因素,其次是 BMI 和年龄,而在从不吸烟队列中,BMI、年龄和 SES 是最重要的贡献因素。
LC 的已知风险因素可用于 ML 模型,以适度预测 LC。需要进一步的研究来确认这些结果在新患者中的有效性,并加以改进。