Department of Biology, Lund University, 22362, Lund, Sweden.
Sci Rep. 2022 Aug 29;12(1):14683. doi: 10.1038/s41598-022-14395-4.
Principal Component Analysis (PCA) is a multivariate analysis that reduces the complexity of datasets while preserving data covariance. The outcome can be visualized on colorful scatterplots, ideally with only a minimal loss of information. PCA applications, implemented in well-cited packages like EIGENSOFT and PLINK, are extensively used as the foremost analyses in population genetics and related fields (e.g., animal and plant or medical genetics). PCA outcomes are used to shape study design, identify, and characterize individuals and populations, and draw historical and ethnobiological conclusions on origins, evolution, dispersion, and relatedness. The replicability crisis in science has prompted us to evaluate whether PCA results are reliable, robust, and replicable. We analyzed twelve common test cases using an intuitive color-based model alongside human population data. We demonstrate that PCA results can be artifacts of the data and can be easily manipulated to generate desired outcomes. PCA adjustment also yielded unfavorable outcomes in association studies. PCA results may not be reliable, robust, or replicable as the field assumes. Our findings raise concerns about the validity of results reported in the population genetics literature and related fields that place a disproportionate reliance upon PCA outcomes and the insights derived from them. We conclude that PCA may have a biasing role in genetic investigations and that 32,000-216,000 genetic studies should be reevaluated. An alternative mixed-admixture population genetic model is discussed.
主成分分析(PCA)是一种多元分析方法,它在保留数据协方差的同时降低数据集的复杂性。结果可以在彩色散点图上可视化,理想情况下信息损失最小。PCA 应用程序,如 EIGENSOFT 和 PLINK 等知名软件包中实现,被广泛应用于群体遗传学和相关领域(例如动物和植物或医学遗传学)的首要分析。PCA 结果用于塑造研究设计、识别和描述个体和群体,并得出关于起源、进化、分散和相关性的历史和民族生物学结论。科学中的可重复性危机促使我们评估 PCA 结果是否可靠、稳健和可重复。我们使用直观的基于颜色的模型和人类群体数据分析了十二个常见测试案例。我们证明了 PCA 结果可能是数据的伪影,可以很容易地操纵它们以产生所需的结果。PCA 调整也会在关联研究中产生不利的结果。PCA 结果可能不像该领域所假设的那样可靠、稳健或可重复。我们的发现引起了人们对群体遗传学文献和相关领域报告的结果的有效性的关注,这些领域过分依赖 PCA 结果及其从中得出的见解。我们得出结论,PCA 可能在遗传研究中具有偏见作用,应该重新评估 32,000-216,000 项遗传研究。讨论了一种替代的混合混合群体遗传模型。