Department of Medical Biosciences/Pathology, Umeå University, Umeå, Västerbotten, Sweden.
Research Centre for Applied Molecular Oncology (RECAMO), Masaryk Memorial Cancer Institute, Brno, Czech Republic.
J Oral Pathol Med. 2023 Aug;52(7):637-643. doi: 10.1111/jop.13461. Epub 2023 Jul 10.
Interpretable machine learning (ML) for early detection of cancer has the potential to improve risk assessment and early intervention.
Data from 261 proteins related to inflammation and/or tumor processes in 123 blood samples collected from healthy persons, but of whom a sub-group later developed squamous cell carcinoma of the oral tongue (SCCOT), were analyzed. Samples from people who developed SCCOT within less than 5 years were classified as tumor-to-be and all other samples as tumor-free. The optimal ML algorithm for feature selection was identified and feature importance computed by the SHapley Additive exPlanations (SHAP) method. Five popular ML algorithms (AdaBoost, Artificial neural networks [ANNs], Decision Tree [DT], eXtreme Gradient Boosting [XGBoost], and Support Vector Machine [SVM]) were applied to establish prediction models, and decisions of the optimal models were interpreted by SHAP.
Using the 22 selected features, the SVM prediction model showed the best performance (sensitivity = 0.867, specificity = 0.859, balanced accuracy = 0.863, area under the receiver operating characteristic curve [ROC-AUC] = 0.924). SHAP analysis revealed that the 22 features rendered varying person-specific impacts on model decision and the top three contributors to prediction were Interleukin 10 (IL10), TNF Receptor Associated Factor 2 (TRAF2), and Kallikrein Related Peptidase 12 (KLK12).
Using multidimensional plasma protein analysis and interpretable ML, we outline a systematic approach for early detection of SCCOT before the appearance of clinical signs.
可解释机器学习(ML)在癌症早期检测中的应用具有改善风险评估和早期干预的潜力。
分析了 123 份来自健康人群的血液样本中 261 种与炎症和/或肿瘤过程相关的蛋白质数据,其中一部分人后来发展为口腔舌鳞状细胞癌(SCCOT)。在不到 5 年内发展为 SCCOT 的患者的样本被归类为肿瘤前样本,其余所有样本被归类为无肿瘤样本。通过 SHapley Additive exPlanations (SHAP) 方法确定了最优的 ML 算法用于特征选择,并计算了特征重要性。应用了五种流行的 ML 算法(AdaBoost、人工神经网络(ANNs)、决策树(DT)、极端梯度提升(XGBoost)和支持向量机(SVM))来建立预测模型,并通过 SHAP 解释最优模型的决策。
使用 22 个选定的特征,SVM 预测模型表现出最佳性能(敏感性=0.867、特异性=0.859、平衡准确性=0.863、受试者工作特征曲线下面积(ROC-AUC)=0.924)。SHAP 分析表明,这 22 个特征对模型决策产生了不同的个体影响,对预测贡献最大的三个因素是白细胞介素 10(IL10)、肿瘤坏死因子受体相关因子 2(TRAF2)和激肽释放酶相关肽 12(KLK12)。
通过多维血浆蛋白分析和可解释的 ML,我们概述了一种在出现临床症状之前早期检测 SCCOT 的系统方法。