Platt Daniel E, Guzmán-Sáenz Aldo, Bose Aritra, Saha Subrata, Utro Filippo, Parida Laxmi
IBM T. J. Watson Research Center, Yorktown Heights, New York, NY, USA.
Pfizer, Pearl River, New York, NY, USA.
iScience. 2024 Feb 12;27(3):109209. doi: 10.1016/j.isci.2024.109209. eCollection 2024 Mar 15.
GWAS focuses on significance loosing false positives; machine learning probes sub-significant features relying on predictivity. Yet, these are far from orthogonal. We sought to explore how these inform each other in sub-genome-wide significant situations to define relevance for predictive features. We introduce the SVM-based RubricOE that selects heavily cross-validated feature sets, and LDpred2 PRS as a strong contrast to SVM, to explore significance and predictivity. Our Alzheimer's test case notoriously lacks strong genetic signals except for few very strong phenotype-SNP associations, which suits the problem we are exploring. We found that the most significant SNPs among ML and PRS-selected SNPs captured most of the predictivity, while weaker associations tend also to contribute weakly to predictivity. SNPs with weak associations tend not to contribute to predictivity, but deletion of these features does not injure it. Significance provides a ranking that helps identify weakly predictive features.
全基因组关联研究(GWAS)侧重于减少假阳性的显著性;机器学习则依靠预测性来探究次显著性特征。然而,这两者并非完全正交。我们试图探索在亚基因组范围显著的情况下,它们如何相互补充以确定预测特征的相关性。我们引入了基于支持向量机(SVM)的RubricOE,它选择经过大量交叉验证的特征集,以及与SVM形成强烈对比的LDpred2多基因风险评分(PRS),以探究显著性和预测性。我们以阿尔茨海默病为例,除了少数非常强的表型-单核苷酸多态性(SNP)关联外,该病例明显缺乏强遗传信号,这适合我们正在探索的问题。我们发现,机器学习和PRS选择的SNP中最显著的SNP捕获了大部分预测性,而较弱的关联对预测性的贡献往往也较弱。关联较弱的SNP往往对预测性没有贡献,但删除这些特征不会损害预测性。显著性提供了一个排名,有助于识别预测性较弱的特征。