可扩展概率主成分分析在大规模遗传变异数据中的应用。

Scalable probabilistic PCA for large-scale genetic variation data.

机构信息

Department of Computer Science, Indian Institute of Technology, Delhi, India.

Bioinformatics Interdepartmental Program, University of California, Los Angeles, California, United States of America.

出版信息

PLoS Genet. 2020 May 29;16(5):e1008773. doi: 10.1371/journal.pgen.1008773. eCollection 2020 May.

DOI:10.1371/journal.pgen.1008773

PMID:32469896

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7286535/

Abstract

Principal component analysis (PCA) is a key tool for understanding population structure and controlling for population stratification in genome-wide association studies (GWAS). With the advent of large-scale datasets of genetic variation, there is a need for methods that can compute principal components (PCs) with scalable computational and memory requirements. We present ProPCA, a highly scalable method based on a probabilistic generative model, which computes the top PCs on genetic variation data efficiently. We applied ProPCA to compute the top five PCs on genotype data from the UK Biobank, consisting of 488,363 individuals and 146,671 SNPs, in about thirty minutes. To illustrate the utility of computing PCs in large samples, we leveraged the population structure inferred by ProPCA within White British individuals in the UK Biobank to identify several novel genome-wide signals of recent putative selection including missense mutations in RPGRIP1L and TLR4.

摘要

主成分分析（PCA）是理解群体结构和控制全基因组关联研究（GWAS）中群体分层的关键工具。随着遗传变异的大规模数据集的出现，需要能够以可扩展的计算和内存需求计算主成分（PC）的方法。我们提出了 ProPCA，这是一种基于概率生成模型的高度可扩展的方法，可有效地计算遗传变异数据上的顶级 PC。我们应用 ProPCA 在 UK Biobank 的基因型数据上计算前五个 PC，该数据集包含 488363 个人和 146671 个 SNP，大约需要三十分钟。为了说明在大样本中计算 PC 的实用性，我们利用 ProPCA 在 UK Biobank 中的英国白人个体中推断的群体结构，鉴定了几个新的全基因组近期假定选择的信号，包括 RPGRIP1L 和 TLR4 中的错义突变。

相似文献

Scalable probabilistic PCA for large-scale genetic variation data.可扩展概率主成分分析在大规模遗传变异数据中的应用。

PLoS Genet. 2020 May 29;16(5):e1008773. doi: 10.1371/journal.pgen.1008773. eCollection 2020 May.

Fast Principal-Component Analysis Reveals Convergent Evolution of ADH1B in Europe and East Asia.快速主成分分析揭示了乙醇脱氢酶1B在欧洲和东亚的趋同进化。

Am J Hum Genet. 2016 Mar 3;98(3):456-472. doi: 10.1016/j.ajhg.2015.12.022. Epub 2016 Feb 25.

A fast and scalable framework for large-scale and ultrahigh-dimensional sparse regression with application to the UK Biobank.一种快速且可扩展的大规模超高维稀疏回归框架及其在 UK Biobank 中的应用。

PLoS Genet. 2020 Oct 23;16(10):e1009141. doi: 10.1371/journal.pgen.1009141. eCollection 2020 Oct.

Haplotype estimation for biobank-scale data sets.生物样本库规模数据集的单倍型估计

Nat Genet. 2016 Jul;48(7):817-20. doi: 10.1038/ng.3583. Epub 2016 Jun 6.

Clustering by genetic ancestry using genome-wide SNP data.基于全基因组 SNP 数据的遗传谱系聚类分析。

BMC Genet. 2010 Dec 9;11:108. doi: 10.1186/1471-2156-11-108.

An atlas of genetic associations in UK Biobank.英国生物银行中的遗传关联图谱

Nat Genet. 2018 Nov;50(11):1593-1599. doi: 10.1038/s41588-018-0248-z. Epub 2018 Oct 22.

Population Structure of UK Biobank and Ancient Eurasians Reveals Adaptation at Genes Influencing Blood Pressure.英国生物银行与古代欧亚人群的人口结构揭示了影响血压基因的适应性变化。

Am J Hum Genet. 2016 Nov 3;99(5):1130-1139. doi: 10.1016/j.ajhg.2016.09.014. Epub 2016 Oct 20.

Sparse principal component analysis for identifying ancestry-informative markers in genome-wide association studies.稀疏主成分分析在全基因组关联研究中识别与祖先相关的标记。

Genet Epidemiol. 2012 May;36(4):293-302. doi: 10.1002/gepi.21621. Epub 2012 Apr 16.

Novel probabilistic models of spatial genetic ancestry with applications to stratification correction in genome-wide association studies.用于全基因组关联研究分层校正的空间遗传血统新型概率模型。

Bioinformatics. 2017 Mar 15;33(6):879-885. doi: 10.1093/bioinformatics/btw720.

Genome-wide association analysis of 350 000 Caucasians from the UK Biobank identifies novel loci for asthma, hay fever and eczema.在英国生物库中对 35 万高加索人进行全基因组关联分析，确定了哮喘、花粉症和湿疹的新易感基因位点。

Hum Mol Genet. 2019 Dec 1;28(23):4022-4041. doi: 10.1093/hmg/ddz175.

引用本文的文献

randPedPCA: rapid approximation of principal components from large pedigrees.randPedPCA：从大型家系中快速近似主成分

Genet Sel Evol. 2025 Aug 28;57(1):46. doi: 10.1186/s12711-025-00994-y.

DAPCy: a Python package for the discriminant analysis of principal components method for population genetic analyses.DAPCy：一个用于群体遗传分析的主成分判别分析方法的Python软件包。

Bioinform Adv. 2025 Jun 18;5(1):vbaf143. doi: 10.1093/bioadv/vbaf143. eCollection 2025.

Principal component analysis revisited: fast multitrait genetic evaluations with smooth convergence.重新审视主成分分析：具有平滑收敛性的快速多性状遗传评估

G3 (Bethesda). 2024 Oct 21;14(12). doi: 10.1093/g3journal/jkae228.

An analysis of the accuracy of retrospective birth location recall using sibling data.利用同胞数据分析回顾性出生地点回忆的准确性。

Nat Commun. 2024 Mar 26;15(1):2665. doi: 10.1038/s41467-024-46781-z.

Efficacy of federated learning on genomic data: a study on the UK Biobank and the 1000 Genomes Project.联邦学习在基因组数据上的功效：对英国生物银行和千人基因组计划的一项研究。

Front Big Data. 2024 Feb 29;7:1266031. doi: 10.3389/fdata.2024.1266031. eCollection 2024.

FiMAP: A fast identity-by-descent mapping test for biobank-scale cohorts.FiMAP：一种用于生物库规模队列的快速基于关系的映射测试。

PLoS Genet. 2023 Dec 1;19(12):e1011057. doi: 10.1371/journal.pgen.1011057. eCollection 2023 Dec.

SuSiE PCA: A scalable Bayesian variable selection technique for principal component analysis.SuSiE主成分分析：一种用于主成分分析的可扩展贝叶斯变量选择技术。

iScience. 2023 Oct 13;26(11):108181. doi: 10.1016/j.isci.2023.108181. eCollection 2023 Nov 17.

Fast and accurate out-of-core PCA framework for large scale biobank data.用于大规模生物库数据的快速准确的核外 PCA 框架。

Genome Res. 2023 Sep;33(9):1599-1608. doi: 10.1101/gr.277525.122. Epub 2023 Aug 24.

The STROMICS genome study: deep whole-genome sequencing and analysis of 10K Chinese patients with ischemic stroke reveal complex genetic and phenotypic interplay.STROMICS基因组研究：对1万名中国缺血性中风患者进行全基因组深度测序和分析揭示了复杂的基因与表型相互作用。

Cell Discov. 2023 Jul 21;9(1):75. doi: 10.1038/s41421-023-00582-8.

Whole genome sequencing across clinical trials identifies rare coding variants in GPR68 associated with chemotherapy-induced peripheral neuropathy.全基因组测序在临床试验中鉴定出与化疗诱导的周围神经病变相关的 GPR68 中的罕见编码变异。

Genome Med. 2023 Jun 21;15(1):45. doi: 10.1186/s13073-023-01193-4.

本文引用的文献

Dating genomic variants and shared ancestry in population-scale sequencing data.在大规模测序数据中追溯基因组变异和共同祖先。

PLoS Biol. 2020 Jan 17;18(1):e3000586. doi: 10.1371/journal.pbio.3000586. eCollection 2020 Jan.

TeraPCA: a fast and scalable software package to study genetic variation in tera-scale genotypes.TeraPCA：一个快速且可扩展的软件包，用于研究万亿级基因型中的遗传变异。

Bioinformatics. 2019 Oct 1;35(19):3679-3683. doi: 10.1093/bioinformatics/btz157.

Dimensionality reduction for visualizing single-cell data using UMAP.使用UMAP进行单细胞数据可视化的降维方法。

Nat Biotechnol. 2018 Dec 3. doi: 10.1038/nbt.4314.

The UK Biobank resource with deep phenotyping and genomic data.英国生物银行资源库，具有深度表型和基因组数据。

Nature. 2018 Oct;562(7726):203-209. doi: 10.1038/s41586-018-0579-z. Epub 2018 Oct 10.

The ciliary protein Rpgrip1l in development and disease.睫状体蛋白Rpgrip1l在发育和疾病中的作用

Dev Biol. 2018 Oct 1;442(1):60-68. doi: 10.1016/j.ydbio.2018.07.024. Epub 2018 Aug 1.

A scalable estimator of SNP heritability for biobank-scale data.用于生物库规模数据的 SNP 遗传力可扩展估计器。

Bioinformatics. 2018 Jul 1;34(13):i187-i194. doi: 10.1093/bioinformatics/bty253.

Efficient analysis of large-scale genome-wide data with two R packages: bigstatsr and bigsnpr.使用两个 R 包：bigstatsr 和 bigsnpr，高效分析大规模全基因组数据。

Bioinformatics. 2018 Aug 15;34(16):2781-2787. doi: 10.1093/bioinformatics/bty185.

Application of t-SNE to human genetic data.t-SNE在人类遗传数据中的应用。

J Bioinform Comput Biol. 2017 Aug;15(4):1750017. doi: 10.1142/S0219720017500172. Epub 2017 Jun 23.

FlashPCA2: principal component analysis of Biobank-scale genotype datasets.FlashPCA2：生物样本库规模基因型数据集的主成分分析

Bioinformatics. 2017 Sep 1;33(17):2776-2778. doi: 10.1093/bioinformatics/btx299.

Detection of human adaptation during the past 2000 years.过去2000年人类适应性的检测。

Science. 2016 Nov 11;354(6313):760-764. doi: 10.1126/science.aag0776. Epub 2016 Oct 13.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

可扩展概率主成分分析在大规模遗传变异数据中的应用。

Scalable probabilistic PCA for large-scale genetic variation data.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献