Department of Medical Biosciences/Pathology, Umeå University, Umeå, Sweden.
Research Centre for Applied Molecular Oncology, Masaryk Memorial Cancer Institute, Brno, 656 53, Czech Republic.
Comput Biol Med. 2022 Oct;149:105991. doi: 10.1016/j.compbiomed.2022.105991. Epub 2022 Aug 18.
Patients with squamous cell carcinoma of the head and neck (SCCHN) have a high-risk of recurrence. We aimed to develop machine learning methods to identify transcriptomic and proteomic features that provide accurate classification models for predicting risk of early recurrence in SCCHN patients.
Clinical, genomic, transcriptomic and proteomic features distinguishing recurrence risk were examined in SCCHN patients from The Cancer Genome Atlas (TCGA). Recurrence within one year after treatment was classified as high-risk and no recurrence as low-risk.
No significant differences in individual clinicopathological characteristics, mutation profiles or mRNA expression patterns were seen between the groups using conventional statistical analysis. Using the machine learning algorithm, extreme gradient boosting (XGBoost), ten proteins (RAD50, 4E-BP1, MYH11, MAP2K1, BECN1, NF2, RAB25, ERRFI1, KDR, SERPINE1) and five mRNAs (PLAUR, DKK1, AXIN2, ANG and VEGFA) made the greatest contribution to classification. These features were used to build improved models in XGBoost, achieving the best discrimination performance when combining transcriptomic and proteomic data, providing an accuracy of 0.939 and an Area Under the ROC Curve (AUC) of 0.951.
This study highlights machine learning to identify transcriptomic and proteomic factors that play important roles in predicting risk of recurrence in patients with SCCHN and to develop such models by iterative cycles to enhance their accuracy, thereby aiding the introduction of personalized treatment regimens.
头颈部鳞状细胞癌(SCCHN)患者有较高的复发风险。我们旨在开发机器学习方法,以识别转录组和蛋白质组特征,为 SCCHN 患者提供准确的早期复发风险分类模型。
在癌症基因组图谱(TCGA)中,我们检查了区分 SCCHN 患者复发风险的临床、基因组、转录组和蛋白质组特征。治疗后一年内复发定义为高风险,无复发为低风险。
使用常规统计分析,两组间的个体临床病理特征、突变谱或 mRNA 表达模式均无显著差异。使用机器学习算法极端梯度增强(XGBoost),十个蛋白(RAD50、4E-BP1、MYH11、MAP2K1、BECN1、NF2、RAB25、ERRFI1、KDR、SERPINE1)和五个 mRNA(PLAUR、DKK1、AXIN2、ANG 和 VEGFA)对分类的贡献最大。这些特征用于在 XGBoost 中构建改进的模型,当结合转录组和蛋白质组数据时,获得最佳的判别性能,准确度为 0.939,ROC 曲线下面积(AUC)为 0.951。
本研究强调了机器学习在识别转录组和蛋白质组因素中的作用,这些因素在预测 SCCHN 患者复发风险中起着重要作用,并通过迭代循环开发这些模型,以提高其准确性,从而有助于引入个性化治疗方案。