Suppr超能文献

可扩展概率主成分分析在大规模遗传变异数据中的应用。

Scalable probabilistic PCA for large-scale genetic variation data.

机构信息

Department of Computer Science, Indian Institute of Technology, Delhi, India.

Bioinformatics Interdepartmental Program, University of California, Los Angeles, California, United States of America.

出版信息

PLoS Genet. 2020 May 29;16(5):e1008773. doi: 10.1371/journal.pgen.1008773. eCollection 2020 May.

Abstract

Principal component analysis (PCA) is a key tool for understanding population structure and controlling for population stratification in genome-wide association studies (GWAS). With the advent of large-scale datasets of genetic variation, there is a need for methods that can compute principal components (PCs) with scalable computational and memory requirements. We present ProPCA, a highly scalable method based on a probabilistic generative model, which computes the top PCs on genetic variation data efficiently. We applied ProPCA to compute the top five PCs on genotype data from the UK Biobank, consisting of 488,363 individuals and 146,671 SNPs, in about thirty minutes. To illustrate the utility of computing PCs in large samples, we leveraged the population structure inferred by ProPCA within White British individuals in the UK Biobank to identify several novel genome-wide signals of recent putative selection including missense mutations in RPGRIP1L and TLR4.

摘要

主成分分析(PCA)是理解群体结构和控制全基因组关联研究(GWAS)中群体分层的关键工具。随着遗传变异的大规模数据集的出现,需要能够以可扩展的计算和内存需求计算主成分(PC)的方法。我们提出了 ProPCA,这是一种基于概率生成模型的高度可扩展的方法,可有效地计算遗传变异数据上的顶级 PC。我们应用 ProPCA 在 UK Biobank 的基因型数据上计算前五个 PC,该数据集包含 488363 个人和 146671 个 SNP,大约需要三十分钟。为了说明在大样本中计算 PC 的实用性,我们利用 ProPCA 在 UK Biobank 中的英国白人个体中推断的群体结构,鉴定了几个新的全基因组近期假定选择的信号,包括 RPGRIP1L 和 TLR4 中的错义突变。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验