非随机缺失数据可能会使群体遗传结构的主成分分析推断产生偏差。

Nonrandom missing data can bias Principal Component Analysis inference of population genetic structure.

作者信息

Yi Xueling, Latch Emily K

机构信息

Behavioral and Molecular Ecology Research Group, Department of Biological Sciences, University of Wisconsin-Milwaukee, Milwaukee, Wisconsin, USA.

出版信息

Mol Ecol Resour. 2022 Feb;22(2):602-611. doi: 10.1111/1755-0998.13498. Epub 2021 Sep 9.

DOI:10.1111/1755-0998.13498

PMID:34463035

Abstract

Population genetic studies in non-model systems increasingly use next-generation sequencing to obtain more loci, but such methods also generate more missing data that may affect downstream analyses. Here we focus on the principal component analysis (PCA) which has been widely used to explore and visualize population structure with mean-imputed missing data. We simulated data of different population models with various total missingness (1%, 10%, 20%) introduced either randomly or biased among individuals or populations. We found that individuals biased with missing data would be dragged away from their real population clusters to the origin of PCA plots, making them indistinguishable from true admixed individuals and potentially leading to misinterpreted population structure. We also generated empirical data of the big brown bat (Eptesicus fuscus) using restriction site-associated DNA sequencing (RADseq). We filtered three data sets with 19.12%, 9.87%, and 1.35% total missingness, all showing nonrandom missing data with biased individuals dragged towards the PCA origin, consistent with results from simulations. We highlight the importance of considering missing data effects on PCA in non-model systems where nonrandom missing data are common due to varying sample quality. To help detect missing data effects, we suggest to (1) plot PCA with a colour gradient showing per sample missingness, (2) interpret samples close to the PCA origin with extra caution, (3) explore filtering parameters with and without the missingness-biased samples, and (4) use complementary analyses (e.g., model-based methods) to cross-validate PCA results and help interpret population structure.

摘要

在非模式生物系统中进行的群体遗传学研究越来越多地使用下一代测序技术来获取更多的基因座，但这些方法也会产生更多的缺失数据，这可能会影响下游分析。在这里，我们重点关注主成分分析（PCA），它已被广泛用于通过均值插补缺失数据来探索和可视化群体结构。我们模拟了不同群体模型的数据，引入了各种总体缺失率（1%、10%、20%），这些缺失数据在个体或群体中随机产生或存在偏差。我们发现，存在缺失数据偏差的个体将被从其真实的群体聚类中拖向主成分分析图的原点，使其与真正的混合个体无法区分，并可能导致对群体结构的错误解读。我们还使用限制性位点相关DNA测序（RADseq）生成了大棕蝠（Eptesicus fuscus）的实证数据。我们对三个总体缺失率分别为19.12%、9.87%和1.35%的数据集进行了筛选，所有数据集都显示出非随机缺失数据，存在偏差的个体被拖向主成分分析原点，这与模拟结果一致。我们强调了在非模式生物系统中考虑缺失数据对主成分分析影响的重要性，在这些系统中，由于样本质量不同，非随机缺失数据很常见。为了帮助检测缺失数据的影响，我们建议：（1）绘制主成分分析图，用颜色梯度表示每个样本的缺失率；（2）格外谨慎地解读靠近主成分分析原点的样本；（3）在有和没有缺失数据偏差样本的情况下探索过滤参数；（4）使用互补分析（如基于模型的方法）来交叉验证主成分分析结果，并帮助解读群体结构。

相似文献

Nonrandom missing data can bias Principal Component Analysis inference of population genetic structure.非随机缺失数据可能会使群体遗传结构的主成分分析推断产生偏差。

Mol Ecol Resour. 2022 Feb;22(2):602-611. doi: 10.1111/1755-0998.13498. Epub 2021 Sep 9.

Robust inference of population structure from next-generation sequencing data with systematic differences in sequencing.有系统测序差异的下一代测序数据中群体结构的稳健推断

Bioinformatics. 2018 Apr 1;34(7):1157-1163. doi: 10.1093/bioinformatics/btx708.

Large-scale inference of population structure in presence of missingness using PCA.使用主成分分析（PCA）在存在缺失值的情况下对群体结构进行大规模推断。

Bioinformatics. 2021 Jul 27;37(13):1868-1875. doi: 10.1093/bioinformatics/btab027.

The Impact of Nonrandom Missingness in Surveillance Data for Population-Level Summaries: Simulation Study.监测数据中随机缺失对人群水平汇总的影响：模拟研究。

JMIR Public Health Surveill. 2022 Sep 9;8(9):e37887. doi: 10.2196/37887.

How do SNP ascertainment schemes and population demographics affect inferences about population history?单核苷酸多态性（SNP）确定方案和人口统计学如何影响对人口历史的推断？

BMC Genomics. 2015 Apr 3;16(1):266. doi: 10.1186/s12864-015-1469-5.

RADseq underestimates diversity and introduces genealogical biases due to nonrandom haplotype sampling.RADseq 由于非随机单倍型采样而低估了多样性并引入了系统发育偏差。

Mol Ecol. 2013 Jun;22(11):3179-90. doi: 10.1111/mec.12276. Epub 2013 Apr 3.

Missing data imputation via the expectation-maximization algorithm can improve principal component analysis aimed at deriving biomarker profiles and dietary patterns.通过期望最大化算法进行缺失数据插补可以改进主成分分析，以得出生物标志物图谱和饮食模式。

Nutr Res. 2020 Mar;75:67-76. doi: 10.1016/j.nutres.2020.01.001. Epub 2020 Jan 9.

Genotype-free estimation of allele frequencies reduces bias and improves demographic inference from RADSeq data.无基因型估计等位基因频率可减少偏差并提高 RADSeq 数据的种群遗传推断准确性。

Mol Ecol Resour. 2019 May;19(3):586-596. doi: 10.1111/1755-0998.12990. Epub 2019 Apr 17.

How "simple" methodological decisions affect interpretation of population structure based on reduced representation library DNA sequencing: A case study using the lake whitefish.简单的方法学决策如何影响基于简化代表性文库 DNA 测序的群体结构解释：以湖白鲑为例的案例研究。

PLoS One. 2020 Jan 24;15(1):e0226608. doi: 10.1371/journal.pone.0226608. eCollection 2020.

Principal components analysis of population admixture.群体混合的主成分分析。

PLoS One. 2012;7(7):e40115. doi: 10.1371/journal.pone.0040115. Epub 2012 Jul 9.

引用本文的文献

Non-Random Mortality in an Experimental Oyster Restoration.实验性牡蛎恢复中的非随机死亡率

Evol Appl. 2025 Jul 6;18(7):e70128. doi: 10.1111/eva.70128. eCollection 2025 Jul.

Fine Scale Patterns of Population Structure and Connectivity in Scandinavian Flat Oysters in Scandinavia ( L.).斯堪的纳维亚半岛平牡蛎（L.）种群结构和连通性的精细尺度模式

Evol Appl. 2025 Mar 31;18(4):e70096. doi: 10.1111/eva.70096. eCollection 2025 Apr.

Pandora: a tool to estimate dimensionality reduction stability of genotype data.潘多拉：一种评估基因型数据降维稳定性的工具。

Bioinform Adv. 2025 Mar 3;5(1):vbaf040. doi: 10.1093/bioadv/vbaf040. eCollection 2025.

Sex chromosome turnover in hybridizing stickleback lineages.杂交棘鱼谱系中的性染色体更替

Evol Lett. 2024 May 11;8(5):658-668. doi: 10.1093/evlett/qrae019. eCollection 2024 Sep.

Genotyping-by-sequencing informs conservation of Andean palms sources of non-timber forest products.基于测序的基因分型为安第斯棕榈树（非木材森林产品来源）的保护提供信息。

Evol Appl. 2024 Jul 31;17(8):e13765. doi: 10.1111/eva.13765. eCollection 2024 Aug.

Sampling strategies for genotyping common bean ( L.) Genebank accessions with DArTseq: a comparison of single plants, multiple plants, and DNA pools.利用DArTseq技术对普通菜豆（Phaseolus vulgaris L.）基因库种质进行基因分型的取样策略：单株、多株和DNA池的比较。

Front Plant Sci. 2024 Jul 11;15:1338332. doi: 10.3389/fpls.2024.1338332. eCollection 2024.

Unraveling the genomic landscape of wrens along western Ecuador's precipitation gradient: Insights into hybridization, isolation by distance, and isolation by the environment.解析厄瓜多尔西部降水梯度沿线鹪鹩的基因组景观：对杂交、距离隔离和环境隔离的见解。

Ecol Evol. 2024 Jul 11;14(7):e11661. doi: 10.1002/ece3.11661. eCollection 2024 Jul.

A trans-oceanic flight of over 4,200 km by painted lady butterflies.黄粉蝶跨海飞行超过 4200 公里。

Nat Commun. 2024 Jun 25;15(1):5205. doi: 10.1038/s41467-024-49079-2.

A lack of genetic diversity and minimal adaptive evolutionary divergence in introduced Mysis shrimp after 50 years.引入的麦氏磷虾在50年后缺乏遗传多样性且适应性进化差异极小。

Evol Appl. 2024 Jan 26;17(1):e13637. doi: 10.1111/eva.13637. eCollection 2024 Jan.

Individual-based landscape genomics for conservation: An analysis pipeline.基于个体的景观基因组学用于保护：一种分析流程。

Mol Ecol Resour. 2023 Oct 26. doi: 10.1111/1755-0998.13884.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

非随机缺失数据可能会使群体遗传结构的主成分分析推断产生偏差。

Nonrandom missing data can bias Principal Component Analysis inference of population genetic structure.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献