Arnal Segura Magdalena, Bini Giorgio, Krithara Anastasia, Paliouras Georgios, Tartaglia Gian Gaetano
Centre for Human Technologies, Istituto Italiano di Tecnologia, Via Enrico Melen, 83, 16152 Genova, Italy.
Department of Biology 'Charles Darwin', Sapienza University of Rome, P.le A. Moro 5, 00185 Rome, Italy.
Int J Mol Sci. 2025 Feb 27;26(5):2085. doi: 10.3390/ijms26052085.
Complex diseases pose challenges in prediction due to their multifactorial and polygenic nature. This study employed machine learning (ML) to analyze genomic data from the UK Biobank, aiming to predict the genomic predisposition to complex diseases like multiple sclerosis (MS) and Alzheimer's disease (AD). We tested logistic regression (LR), ensemble tree methods, and deep learning models for this purpose. LR displayed remarkable stability across various subsets of data, outshining deep learning approaches, which showed greater variability in performance. Additionally, ML methods demonstrated an ability to maintain optimal performance despite correlated genomic features due to linkage disequilibrium. When comparing the performance of polygenic risk score (PRS) with ML methods, PRS consistently performed at an average level. By employing explainability tools in the ML models of MS, we found that the results confirmed the polygenicity of this disease. The highest-prioritized genomic variants in MS were identified as expression or splicing quantitative trait loci located in non-coding regions within or near genes associated with the immune response, with a prevalence of human leukocyte antigen (HLA) gene annotations. Our findings shed light on both the potential and the challenges of employing ML to capture complex genomic patterns, paving the way for improved predictive models.
复杂疾病因其多因素和多基因性质在预测方面面临挑战。本研究采用机器学习(ML)分析英国生物银行的基因组数据,旨在预测诸如多发性硬化症(MS)和阿尔茨海默病(AD)等复杂疾病的基因组易感性。为此,我们测试了逻辑回归(LR)、集成树方法和深度学习模型。LR在数据的各个子集上表现出显著的稳定性,优于深度学习方法,深度学习方法的性能表现出更大的可变性。此外,尽管由于连锁不平衡存在相关的基因组特征,ML方法仍显示出保持最佳性能的能力。当将多基因风险评分(PRS)与ML方法的性能进行比较时,PRS始终表现处于平均水平。通过在MS的ML模型中使用可解释性工具,我们发现结果证实了该疾病的多基因性。MS中优先级最高的基因组变异被确定为位于与免疫反应相关基因内部或附近非编码区域的表达或剪接数量性状位点,其中人类白细胞抗原(HLA)基因注释占比很高。我们的研究结果揭示了使用ML捕捉复杂基因组模式的潜力和挑战,为改进预测模型铺平了道路。