Liu Xiran, Crawford Lorin, Ramachandran Sohini
Brown University, Providence, RI 02906, USA.
Microsoft Research, Cambridge, MA 02142, USA.
bioRxiv. 2025 Feb 12:2025.02.11.637655. doi: 10.1101/2025.02.11.637655.
A fundamental goal of genetics is to identify which and how genetic variants are associated with a trait, often using the regression results from genome-wide association (GWA) studies. Important methodological challenges are accounting for inflation in GWA effect estimates as well as investigating more than one trait simultaneously. We leverage machine learning approaches for these two challenges, developing a computationally efficient method called . First, we shrink the inflation in GWA effect sizes caused by non-independence among variants using neural networks. We then cluster variant associations among multiple traits via variational inference. We compare the performance of shrinkage via neural networks to regularized regression and fine-mapping, two approaches used for addressing inflated effects but dealing with variants in focal regions of different sizes. Our neural network shrinkage outperforms both methods in approximating the true effect sizes in simulated data. Our infinite mixture clustering approach offers a flexible, data-driven way to distinguish different types of associations-trait-specific, shared across traits, or spurious-among multiple traits based on their regularized effects. Clustering applied to our neural network shrinkage results also produces consistently higher precision and recall for distinguishing gene-level associations in simulations. We demonstrate the application of on association analyses of two quantitative traits and two binary traits in the UK Biobank (genetic and phenotypic data from 500,000 residents of the UK). Our identified associated genes from single-trait enrichment tests overlap with those having known relevant biological processes to the traits. Besides trait-specific associations, identifies several variants with shared multi-trait associations, suggesting putative shared genetic architecture.
遗传学的一个基本目标是确定哪些基因变异与某一性状相关以及它们是如何相关的,通常会利用全基因组关联(GWA)研究的回归结果。重要的方法学挑战包括应对GWA效应估计中的膨胀现象以及同时研究多个性状。我们利用机器学习方法应对这两个挑战,开发了一种计算效率高的方法,称为 。首先,我们使用神经网络来缩小由变异之间的非独立性导致的GWA效应大小的膨胀。然后,我们通过变分推理对多个性状之间的变异关联进行聚类。我们将通过神经网络进行收缩的性能与正则化回归和精细定位这两种用于解决膨胀效应但处理不同大小焦点区域变异的方法进行比较。在模拟数据中逼近真实效应大小时,我们的神经网络收缩方法优于这两种方法。我们的无限混合聚类方法提供了一种灵活的数据驱动方式,可根据多个性状的正则化效应来区分不同类型的关联——特定于性状的、跨性状共享的或虚假的。应用于我们神经网络收缩结果的聚类在模拟中区分基因水平关联时也始终产生更高的精度和召回率。我们展示了 在英国生物银行(来自50万英国居民的遗传和表型数据)中两个定量性状和两个二元性状关联分析中的应用。我们从单性状富集测试中确定的相关基因与那些已知与性状有相关生物学过程的基因重叠。除了特定于性状的关联外, 还识别出了几个具有多性状共享关联的变异,这表明存在假定的共享遗传结构。