Song Qilong, Li Xiaohu, Song Biao, Zhang Tingting, Hu Xiankuo, Li Ao, Ma Dongchun, Min Xuhong, Yu Yongqiang
Department of Radiology, the First Affiliated Hospital of Anhui Medical University, Hefei, China.
Clinical Chest College of Anhui Medical University, Hefei, China.
Transl Lung Cancer Res. 2025 Jul 31;14(7):2670-2687. doi: 10.21037/tlcr-2025-237. Epub 2025 Jul 28.
Non-invasive determination of epidermal growth factor receptor (EGFR) mutation status is essential for selecting lung adenocarcinoma patients suitable for EGFR-tyrosine kinase inhibitors (EGFR-TKIs). This study aimed to develop and validate an online ensemble machine learning (EML) model that combines multiple machine learning (ML) models to predict the EGFR mutation status in lung adenocarcinoma.
A total of 823 lung adenocarcinoma patients with known EGFR mutation status from three medical centers were divided into a training cohort (n=556) and a validation cohort (n=267) (ChiCTR2400083082 in the WHO International Clinical Trials Registry). Five ML models incorporating clinical and radiological characteristics-random forest (RF), logistic regression (LR), support vector machine (SVM), light gradient boosting machine (LightGBM), and extreme gradient boosting (XGBoost)-along with a CT-based deep learning (DL) model were constructed to predict EGFR mutation status. Subsequently, an EML model was created by combining these models. Model performance was assessed using the area under the receiver operating characteristic curve (AUC), and the SHapley Additive exPlanation (SHAP) method was used to explain the EML model.
In the training cohort, the AUCs for the RF, LR, SVM, LightGBM, XGBoost, DL, and EML were 0.851, 0.790, 0.810, 0.835, 0.853, 0.884, and 0.928, respectively. In the validation cohort, the AUCs for the RF, LR, SVM, LightGBM, XGBoost, DL, and EML were 0.753, 0.744, 0.732, 0.749, 0.751, 0.754, and 0.813, respectively. The Delong test indicated that the AUC of the EML model showed outstanding performance compared to the single models in both the training and validation cohorts. Decision curve analysis indicated that the EML model provided a clinically useful net benefit, and calibration curves showed good agreement. SHAP analysis identified predictive characteristics ranked by their contribution to the EML model: DL score, long-axis diameter, smoking history, pleural retraction, texture, vascular convergence, sex, air bronchogram, and bubblelike lucency. These characteristics were further used to develop an online web tool.
The EML model could serve as a non-invasive and accurate method for predicting EGFR mutation status in lung adenocarcinoma.
非侵入性确定表皮生长因子受体(EGFR)突变状态对于选择适合EGFR酪氨酸激酶抑制剂(EGFR-TKIs)的肺腺癌患者至关重要。本研究旨在开发并验证一种在线集成机器学习(EML)模型,该模型结合多个机器学习(ML)模型来预测肺腺癌中的EGFR突变状态。
来自三个医疗中心的823例已知EGFR突变状态的肺腺癌患者被分为训练队列(n=556)和验证队列(n=267)(WHO国际临床试验注册中心编号为ChiCTR2400083082)。构建了五个纳入临床和放射学特征的ML模型——随机森林(RF)、逻辑回归(LR)、支持向量机(SVM)、轻量级梯度提升机(LightGBM)和极端梯度提升(XGBoost)——以及一个基于CT的深度学习(DL)模型来预测EGFR突变状态。随后,通过组合这些模型创建了一个EML模型。使用受试者操作特征曲线下面积(AUC)评估模型性能,并使用SHapley值加法解释(SHAP)方法来解释EML模型。
在训练队列中,RF、LR、SVM、LightGBM、XGBoost、DL和EML的AUC分别为0.851、0.790、0.810、0.835、0.853、0.884和0.928。在验证队列中,RF、LR、SVM、LightGBM、XGBoost、DL和EML的AUC分别为0.753、0.744、0.732、0.749、0.751、0.754和0.813。Delong检验表明,在训练队列和验证队列中,EML模型的AUC与单个模型相比均表现出色。决策曲线分析表明,EML模型提供了具有临床实用价值的净效益,校准曲线显示出良好的一致性。SHAP分析确定了按对EML模型的贡献排名的预测特征:DL评分、长轴直径、吸烟史、胸膜凹陷、纹理、血管汇聚、性别、空气支气管征和泡状透亮区。这些特征被进一步用于开发一个在线网络工具。
EML模型可作为一种非侵入性且准确的方法来预测肺腺癌中的EGFR突变状态。