Danelakis Antonios, Kumelj Tjaša, Winsvold Bendik S, Helene Bjørk Marte, Nachev Parashkev, Matharu Manjit, Giles Dominic, Tronvik Erling, Langseth Helge, Stubberud Anker
NorHead Norwegian Centre for Headache Research, NTNU Norwegian University of Science and Technology, 7030, Trondheim, Norway.
Department of Computer Science, NTNU Norwegian University of Science and Technology, 7034, Trondheim, Norway.
Brain. 2025 May 6. doi: 10.1093/brain/awaf172.
Migraine has an assumed polygenic basis, but the genetic risk variants identified in genome-wide association studies only explain a proportion of the heritability. We aimed to develop machine learning models, capturing non-additive and interactive effects, to address the missing heritability. This was a cross-sectional population-based study of participants in the second and third Trøndelag Health Study. Individuals underwent genome-wide genotyping and were phenotyped based on validated modified criteria of the International Classification of Headache Disorders. Four datasets of increasing number of genetic variants were created using different thresholds of linkage disequilibrium and univariate genome-wide associated p-values. A series of machine learning and deep learning methods were optimized and evaluated. The genotype tools PLINK and LDPred2 were used for polygenic risk scoring. Models were trained on a partition of the dataset and tested in a hold-out set. The area under the receiver operating characteristics curve was used as the primary scoring metric. Classification by machine learning was statistically compared to that of polygenic risk scoring. Finally, we explored the biological functions of the variants unique to the machine learning approach. 43,197 individuals (51% women), with a mean age of 54.6 years, were included in the modelling. A light gradient boosting machine performed best for the three smallest datasets (108, 7,771 and 7,840 variants), all with hold-out test set area under curve at 0.63. A multinomial naïve Bayes model performed best in the largest dataset (140,467 variants) with a hold-out test set area under curve of 0.62. The models were statistically significantly superior to polygenic risk scoring (area under curve 0.52 to 0.59) for all the datasets (p<0.001 to p=0.02). Machine learning identified many of the same genes and pathways identified in genome-wide association studies, but also several unique pathways, mainly related to signal transduction and neurological function. Interestingly, pathways related to botulinum toxins, and pathways related to the calcitonin gene-related peptide receptor also emerged. This study suggests that migraine may follow a non-additive and interactive genetic causal structure, potentially best captured by complex machine learning models. Such structure may be concealed where the data dimensionality (high number of genetic variants) is insufficiently supported by the scale of available data, leaving a misleading impression of purely additive effects. Future machine learning models using substantially larger sample sizes could harness both the additive and the interactive effects, enhancing precision and offering deeper understanding of genetic interactions underlying migraine.
偏头痛被认为具有多基因基础,但全基因组关联研究中确定的遗传风险变异仅解释了部分遗传力。我们旨在开发机器学习模型,捕捉非加性和交互作用,以解决遗传力缺失的问题。这是一项基于特隆赫姆健康研究第二轮和第三轮参与者的横断面人群研究。个体接受了全基因组基因分型,并根据国际头痛疾病分类的验证修改标准进行表型分析。使用不同的连锁不平衡阈值和单变量全基因组关联p值创建了四个遗传变异数量不断增加的数据集。对一系列机器学习和深度学习方法进行了优化和评估。使用基因型工具PLINK和LDPred2进行多基因风险评分。模型在数据集的一个分区上进行训练,并在一个留出集中进行测试。将受试者工作特征曲线下面积用作主要评分指标。将机器学习分类与多基因风险评分的分类进行统计学比较。最后,我们探索了机器学习方法特有的变异的生物学功能。建模纳入了43197名个体(51%为女性),平均年龄54.6岁。对于三个最小的数据集(108、7771和7840个变异),轻梯度提升机表现最佳,所有留出测试集的曲线下面积均为0.63。多项式朴素贝叶斯模型在最大的数据集(140467个变异)中表现最佳,留出测试集的曲线下面积为0.62。对于所有数据集,这些模型在统计学上显著优于多基因风险评分(曲线下面积为0.52至0.59)(p<0.001至p=0.02)。机器学习识别出了许多在全基因组关联研究中确定的相同基因和通路,但也有几个独特的通路,主要与信号转导和神经功能有关。有趣的是,与肉毒杆菌毒素相关的通路以及与降钙素基因相关肽受体相关的通路也出现了。这项研究表明,偏头痛可能遵循非加性和交互性的遗传因果结构,复杂的机器学习模型可能最能捕捉到这种结构。当数据维度(大量遗传变异)得不到可用数据规模的充分支持时,这种结构可能会被隐藏起来,给人一种纯粹加性效应的误导性印象。未来使用大得多的样本量 的机器学习模型可以利用加性和交互性效应,提高精度,并更深入地理解偏头痛背后的遗传相互作用。