Suppr超能文献

通过利用稀疏的血统调整样本相关性对大型多血统生物样本库进行可扩展分析。

Scalable analysis of large multi-ancestry biobanks by leveraging sparse ancestry-adjusted sample-relatedness.

作者信息

Lin Xihong, Dey Rounak, Li Xihao, Li Zilin

机构信息

Harvard T.H. Chan School of Public Health.

Insitro, Inc.

出版信息

Res Sq. 2024 Nov 12:rs.3.rs-5343361. doi: 10.21203/rs.3.rs-5343361/v1.

Abstract

Linear mixed-effects models (LMMs) and ridge regression are commonly applied in genetic association studies to control for population structure and sample-relatedness. To control for sample-relatedness, the existing methods use empirical genetic relatedness matrices (GRM) either explicitly or conceptually. This works well with mostly homogeneous populations, however, in multi-ancestry heterogeneous populations, GRMs are confounded with population structure which leads to inflated type I error rates, massively increased computation, and reduced power. Here, we propose FastSparseGRM, a scalable pipeline for multi-ancestry Genome-Wide Association studies (GWAS) and Whole Genome Sequencing (WGS) studies. It utilizes a block-diagonal sparse ancestry-adjusted (BDSA) GRM to model sample-relatedness, and ancestry PCs as fixed effects to control for population structure. It is ~ 2540/4100/54 times faster than BOLT-LMM/fast-GWA/REGENIE for fitting the null LMM on 50,000 heterogeneous subjects. Through numerical simulations and both single-variant GWAS and rare variant WGS analyses of five biomarkers (Triglycerides, HDL, LDL, BMI, Total Bilirubin) on the entire UK Biobank data, we demonstrate that our approach scales to nearly half-a-million subjects and provides accurate p-value calibration and improved power compared to the existing methods.

摘要

线性混合效应模型(LMMs)和岭回归常用于基因关联研究,以控制群体结构和样本相关性。为了控制样本相关性,现有方法要么明确地要么在概念上使用经验性遗传相关矩阵(GRM)。这在大多为同质群体的情况下效果良好,然而,在多祖先异质群体中,GRM与群体结构混淆,导致I型错误率膨胀、计算量大幅增加以及功效降低。在此,我们提出FastSparseGRM,这是一种用于多祖先全基因组关联研究(GWAS)和全基因组测序(WGS)研究的可扩展流程。它利用块对角稀疏祖先调整(BDSA)GRM对样本相关性进行建模,并将祖先主成分作为固定效应来控制群体结构。在对50,000名异质受试者拟合无效LMM时,它比BOLT-LMM/fast-GWA/REGENIE快约2540/4100/54倍。通过数值模拟以及对整个英国生物银行数据中五个生物标志物(甘油三酯、高密度脂蛋白、低密度脂蛋白、体重指数、总胆红素)进行单变量GWAS和罕见变异WGS分析,我们证明我们的方法可扩展到近50万受试者,并且与现有方法相比,能提供准确的p值校准并提高功效。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6042/11601839/c8482397b8d6/nihpp-rs5343361v1-f0001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验