Felici Alessio, Peduzzi Giulia, Pellungrini Roberto, Campa Daniele, Canzian Federico
Department of Biology, University of Pisa, Pisa 56126, Italy.
Genomic Epidemiology Group, German Cancer Research Center (DKFZ), Heidelberg 69120, Germany.
Brain Commun. 2025 May 27;7(3):fcaf187. doi: 10.1093/braincomms/fcaf187. eCollection 2025.
Glioblastoma multiforme is a lethal disease, with a 5-year survival rate of <10%. The identification of risk factors for glioblastoma multiforme is essential for the understanding of this disease and could facilitate more effective stratification of high-risk individuals. However, our current knowledge of glioblastoma multiforme risk factors is limited. Given the complexity and heterogeneity of the disease, traditional epidemiological approaches may be insufficient to study risk factors for glioblastoma multiforme. The combination of traditional approaches with machine learning models could prove effective in identifying relevant factors for glioblastoma multiforme risk. In this study, we developed glioblastoma multiformerisk models in the UK Biobank cohort using 576 glioblastoma multiforme cases and 302 602 controls. First, 369 exposures were tested with traditional regression models in a case-control study and significant associations were identified. Subsequently, significant features were filtered based on their completion rate and correlation. The selected exposures were then used to develop two machine learning models: a support vector machine and a Multi-Layer Perceptron. To address the imbalance within the subpopulation, two controls per case with full data were selected, resulting in 442 glioblastoma multiforme cases and 884 controls being analysed with the machine learning models. Relevant factors for glioblastoma multiforme risk were identified by explaining the results of the two models with Shapley Additive explanations. Traditional regression methods identified 38 significant associations between environmental exposures and glioblastoma multiforme risk under the Bonferroni threshold ( < 1.35 × 10). Subsequent filtration results in the selection of 12 exposures, which were then analysed with age, sex and a polygenic score using the two machine learning models. Support vector machine and the multi-layer perceptron demonstrated a good sensitivity (0.91 and 0.82, respectively). In addition to age and genetics, Shapley Additive explanations demonstrated significant contributions of insulin-like growth factor 1 blood levels and the right-hand grip strength on the predictions made by the models, with the latter effect potentially being confounded by endogenous testosterone levels. The integration of machine learning with traditional models has the potential to enhance the identification of risk factors for glioblastoma multiforme.
多形性胶质母细胞瘤是一种致命疾病,5年生存率低于10%。确定多形性胶质母细胞瘤的危险因素对于了解这种疾病至关重要,并且有助于对高危个体进行更有效的分层。然而,我们目前对多形性胶质母细胞瘤危险因素的了解有限。鉴于该疾病的复杂性和异质性,传统的流行病学方法可能不足以研究多形性胶质母细胞瘤的危险因素。将传统方法与机器学习模型相结合可能被证明在识别多形性胶质母细胞瘤风险的相关因素方面是有效的。在本研究中,我们在英国生物银行队列中使用576例多形性胶质母细胞瘤病例和302602例对照建立了多形性胶质母细胞瘤风险模型。首先,在一项病例对照研究中用传统回归模型测试了369种暴露因素,并确定了显著关联。随后,根据其完成率和相关性对显著特征进行筛选。然后,将选定的暴露因素用于建立两个机器学习模型:支持向量机和多层感知器。为了解决亚组内的不平衡问题,为每个病例选择了两个具有完整数据的对照,从而有442例多形性胶质母细胞瘤病例和884例对照用机器学习模型进行分析。通过用夏普利加性解释法解释这两个模型的结果,确定了多形性胶质母细胞瘤风险的相关因素。传统回归方法在邦费罗尼阈值(<1.35×10)下确定了38种环境暴露与多形性胶质母细胞瘤风险之间的显著关联。随后的筛选结果是选择了12种暴露因素,然后使用这两个机器学习模型对其与年龄、性别和多基因评分进行分析。支持向量机和多层感知器表现出良好的敏感性(分别为0.91和0.82)。除年龄和遗传学外,夏普利加性解释法表明胰岛素样生长因子1血液水平和右手握力对模型预测有显著贡献,后者的影响可能被内源性睾酮水平混淆。机器学习与传统模型的整合有可能加强对多形性胶质母细胞瘤危险因素的识别。