结合迭代修剪主成分分析和结构对大型高度分层人群数据集进行研究。

Study of large and highly stratified population datasets by combining iterative pruning principal component analysis and structure.

机构信息

Faculty of Engineering, King Mongkut's Institute of Technology Ladkrabang, Bangkok, Thailand.

出版信息

BMC Bioinformatics. 2011 Jun 23;12:255. doi: 10.1186/1471-2105-12-255.

DOI:10.1186/1471-2105-12-255

PMID:21699684

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3148578/

Abstract

BACKGROUND

The ever increasing sizes of population genetic datasets pose great challenges for population structure analysis. The Tracy-Widom (TW) statistical test is widely used for detecting structure. However, it has not been adequately investigated whether the TW statistic is susceptible to type I error, especially in large, complex datasets. Non-parametric, Principal Component Analysis (PCA) based methods for resolving structure have been developed which rely on the TW test. Although PCA-based methods can resolve structure, they cannot infer ancestry. Model-based methods are still needed for ancestry analysis, but they are not suitable for large datasets. We propose a new structure analysis framework for large datasets. This includes a new heuristic for detecting structure and incorporation of the structure patterns inferred by a PCA method to complement STRUCTURE analysis.

RESULTS

A new heuristic called EigenDev for detecting population structure is presented. When tested on simulated data, this heuristic is robust to sample size. In contrast, the TW statistic was found to be susceptible to type I error, especially for large population samples. EigenDev is thus better-suited for analysis of large datasets containing many individuals, in which spurious patterns are likely to exist and could be incorrectly interpreted as population stratification. EigenDev was applied to the iterative pruning PCA (ipPCA) method, which resolves the underlying subpopulations. This subpopulation information was used to supervise STRUCTURE analysis to infer patterns of ancestry at an unprecedented level of resolution. To validate the new approach, a bovine and a large human genetic dataset (3945 individuals) were analyzed. We found new ancestry patterns consistent with the subpopulations resolved by ipPCA.

CONCLUSIONS

The EigenDev heuristic is robust to sampling and is thus superior for detecting structure in large datasets. The application of EigenDev to the ipPCA algorithm improves the estimation of the number of subpopulations and the individual assignment accuracy, especially for very large and complex datasets. Furthermore, we have demonstrated that the structure resolved by this approach complements parametric analysis, allowing a much more comprehensive account of population structure. The new version of the ipPCA software with EigenDev incorporated can be downloaded from http://www4a.biotec.or.th/GI/tools/ippca.

摘要

背景

不断增长的人口遗传数据集给群体结构分析带来了巨大的挑战。特雷西- widom （TW）统计检验被广泛用于检测结构。然而，TW 统计量是否容易出现 I 型错误，特别是在大型、复杂的数据集上，尚未得到充分研究。已经开发了基于非参数、主成分分析（PCA）的方法来解决结构问题，这些方法依赖于 TW 检验。虽然基于 PCA 的方法可以解决结构问题，但它们不能推断祖先。仍然需要基于模型的方法进行祖先分析，但它们不适合大型数据集。我们提出了一种新的用于大型数据集的结构分析框架。这包括一种新的检测结构的启发式方法，并结合 PCA 方法推断出的结构模式来补充 STRUCTURE 分析。

结果

提出了一种新的启发式方法，称为 EigenDev，用于检测群体结构。在测试模拟数据时，该启发式方法对样本大小具有鲁棒性。相比之下，TW 统计量被发现容易出现 I 型错误，特别是对于大的群体样本。因此，EigenDev 更适合分析包含许多个体的大型数据集，在这些数据集中，可能存在虚假模式，并可能被错误地解释为群体分层。EigenDev 被应用于迭代修剪 PCA（ipPCA）方法，该方法解决了潜在的亚群。使用该亚群信息来监督 STRUCTURE 分析，以空前的分辨率推断出祖先模式。为了验证新方法，分析了一个牛和一个大型人类遗传数据集（3945 个人）。我们发现了与 ipPCA 解析的亚群一致的新的祖先模式。

结论

EigenDev 启发式方法对抽样具有鲁棒性，因此更适合于大型数据集的结构检测。将 EigenDev 应用于 ipPCA 算法可以提高亚群数量的估计和个体分配的准确性，特别是对于非常大和复杂的数据集。此外，我们已经证明，这种方法解析的结构补充了参数分析，允许更全面地描述群体结构。带有集成 EigenDev 的新版本 ipPCA 软件可以从 http://www4a.biotec.or.th/GI/tools/ippca 下载。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/24e2/3148578/022a0e3a83e8/1471-2105-12-255-1.jpg

相似文献

Study of large and highly stratified population datasets by combining iterative pruning principal component analysis and structure.结合迭代修剪主成分分析和结构对大型高度分层人群数据集进行研究。

BMC Bioinformatics. 2011 Jun 23;12:255. doi: 10.1186/1471-2105-12-255.

Iterative pruning PCA improves resolution of highly structured populations.迭代剪枝主成分分析提高高度结构化群体的分辨率。

BMC Bioinformatics. 2009 Nov 23;10:382. doi: 10.1186/1471-2105-10-382.

PCA-based population structure inference with generic clustering algorithms.基于主成分分析的群体结构推断与通用聚类算法

BMC Bioinformatics. 2009 Jan 30;10 Suppl 1(Suppl 1):S73. doi: 10.1186/1471-2105-10-S1-S73.

FastPop: a rapid principal component derived method to infer intercontinental ancestry using genetic data.FastPop：一种利用遗传数据推断洲际血统的快速主成分衍生方法。

BMC Bioinformatics. 2016 Mar 9;17:122. doi: 10.1186/s12859-016-0965-1.

Novel probabilistic models of spatial genetic ancestry with applications to stratification correction in genome-wide association studies.用于全基因组关联研究分层校正的空间遗传血统新型概率模型。

Bioinformatics. 2017 Mar 15;33(6):879-885. doi: 10.1093/bioinformatics/btw720.

Robust inference of population structure for ancestry prediction and correction of stratification in the presence of relatedness.在存在亲缘关系的情况下，对群体结构进行稳健推断，以进行血统预测和分层校正。

Genet Epidemiol. 2015 May;39(4):276-93. doi: 10.1002/gepi.21896. Epub 2015 Mar 23.

GRAF-pop: A Fast Distance-Based Method To Infer Subject Ancestry from Multiple Genotype Datasets Without Principal Components Analysis.GRAF-pop：一种无需主成分分析即可基于距离推断个体祖先的快速方法，适用于多种基因型数据集。

G3 (Bethesda). 2019 Aug 8;9(8):2447-2461. doi: 10.1534/g3.118.200925.

SHEsisPCA: a GPU-based software to correct for population stratification that efficiently accelerates the process for handling genome-wide datasets.SHEsisPCA：一种基于 GPU 的用于校正群体分层的软件，它可以有效地加速处理全基因组数据集的过程。

J Genet Genomics. 2015 Aug 20;42(8):445-53. doi: 10.1016/j.jgg.2015.06.007. Epub 2015 Jul 9.

Fast principal component analysis of large-scale genome-wide data.大规模全基因组数据的快速主成分分析。

PLoS One. 2014 Apr 9;9(4):e93766. doi: 10.1371/journal.pone.0093766. eCollection 2014.

SHIPS: Spectral Hierarchical clustering for the Inference of Population Structure in genetic studies.SHIPS：遗传研究中用于推断群体结构的谱层次聚类。

PLoS One. 2012;7(10):e45685. doi: 10.1371/journal.pone.0045685. Epub 2012 Oct 12.

引用本文的文献

Advances in Whole Genome Sequencing: Methods, Tools, and Applications in Population Genomics.全基因组测序进展：群体基因组学中的方法、工具及应用

Int J Mol Sci. 2025 Jan 4;26(1):372. doi: 10.3390/ijms26010372.

Genetic Ancestry Inference and Its Application for the Genetic Mapping of Human Diseases.遗传血统推断及其在人类疾病遗传图谱绘制中的应用。

Int J Mol Sci. 2021 Jun 28;22(13):6962. doi: 10.3390/ijms22136962.

Assessing the power of principal components and wright's fixation index analyzes applied to reveal the genome-wide genetic differences between herds of Holstein cows.评估主成分和 Wright 的固定指数分析的功效，应用于揭示荷斯坦奶牛群体间的全基因组遗传差异。

BMC Genet. 2020 Apr 28;21(1):47. doi: 10.1186/s12863-020-00848-0.

Ancestry-informative marker (AIM) SNP panel for the Malay population.马来人群的祖先信息标记（AIM）单核苷酸多态性（SNP）面板。

Int J Legal Med. 2020 Jan;134(1):123-134. doi: 10.1007/s00414-019-02184-0. Epub 2019 Nov 23.

A different view on fine-scale population structure in Western African populations.对西非人群中精细尺度人口结构的不同看法。

Hum Genet. 2020 Jan;139(1):45-59. doi: 10.1007/s00439-019-02069-7. Epub 2019 Oct 19.

IPCAPS: an R package for iterative pruning to capture population structure.IPCAPS：一个用于迭代剪枝以捕捉群体结构的R包。

Source Code Biol Med. 2019 Mar 20;14:2. doi: 10.1186/s13029-019-0072-6. eCollection 2019.

KinVis: a visualization tool to detect cryptic relatedness in genetic datasets.KinVis：一种可视化工具，用于检测遗传数据集的隐藏关联性。

Bioinformatics. 2019 Aug 1;35(15):2683-2685. doi: 10.1093/bioinformatics/bty1028.

Nonparametric approaches for population structure analysis.非参数群体结构分析方法。

Hum Genomics. 2018 May 9;12(1):25. doi: 10.1186/s40246-018-0156-4.

The RC strain is highly diverged and harbors putatively novel drug resistance variants.RC菌株高度分化，含有可能的新型耐药变异体。

PeerJ. 2017 Oct 5;5:e3766. doi: 10.7717/peerj.3766. eCollection 2017.

Effects of Multiple Genetic Loci on Age at Onset in Frontotemporal Dementia.多个基因位点对额颞叶痴呆发病年龄的影响。

J Alzheimers Dis. 2017;56(4):1271-1278. doi: 10.3233/JAD-160949.

本文引用的文献

Ancestry informative markers for fine-scale individual assignment to worldwide populations.用于全球人群精细个体归属的祖先信息标记。

J Med Genet. 2010 Dec;47(12):835-47. doi: 10.1136/jmg.2010.078212. Epub 2010 Oct 4.

Genome-wide patterns of population structure and admixture in West Africans and African Americans.西非人和非裔美国人的全基因组人口结构和混合模式。

Proc Natl Acad Sci U S A. 2010 Jan 12;107(2):786-91. doi: 10.1073/pnas.0909559107. Epub 2009 Dec 22.

Iterative pruning PCA improves resolution of highly structured populations.迭代剪枝主成分分析提高高度结构化群体的分辨率。

BMC Bioinformatics. 2009 Nov 23;10:382. doi: 10.1186/1471-2105-10-382.

The genetic structure and history of Africans and African Americans.非洲人和非裔美国人的基因结构与历史。

Science. 2009 May 22;324(5930):1035-44. doi: 10.1126/science.1172257. Epub 2009 Apr 30.

Genome-wide survey of SNP variation uncovers the genetic structure of cattle breeds.全基因组单核苷酸多态性变异调查揭示了牛品种的遗传结构。

Science. 2009 Apr 24;324(5926):528-32. doi: 10.1126/science.1167936.

Analysis and application of European genetic substructure using 300 K SNP information.利用30万单核苷酸多态性信息对欧洲遗传亚结构进行分析与应用

PLoS Genet. 2008 Jan;4(1):e4. doi: 10.1371/journal.pgen.0040004.

GENOME: a rapid coalescent-based whole genome simulator.基因组：一种基于快速合并的全基因组模拟器。

Bioinformatics. 2007 Jun 15;23(12):1565-7. doi: 10.1093/bioinformatics/btm138. Epub 2007 Apr 25.

Low levels of genetic divergence across geographically and linguistically diverse populations from India.来自印度的地理和语言多样的人群中基因差异水平较低。

PLoS Genet. 2006 Dec;2(12):e215. doi: 10.1371/journal.pgen.0020215.

Population structure and eigenanalysis.群体结构与特征分析

PLoS Genet. 2006 Dec;2(12):e190. doi: 10.1371/journal.pgen.0020190.

African-American mitochondrial DNAs often match mtDNAs found in multiple African ethnic groups.非裔美国人的线粒体DNA常常与在多个非洲族群中发现的线粒体DNA相匹配。

BMC Biol. 2006 Oct 12;4:34. doi: 10.1186/1741-7007-4-34.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

结合迭代修剪主成分分析和结构对大型高度分层人群数据集进行研究。

Study of large and highly stratified population datasets by combining iterative pruning principal component analysis and structure.

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSIONS

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献